diff mbox series

[net-next,1/4] selftests: forwarding: Test redirecting gre or ipip packets to Ethernet

Message ID 0a4e63cd3cde3c71cfc422a7f0f5e9bc76c0c1f5.1625056665.git.gnault@redhat.com
State New
Headers show
Series selftests: forwarding: Tests redirects between L3 and L2 devices | expand

Commit Message

Guillaume Nault June 30, 2021, 12:51 p.m. UTC
Add selftests for the following commits:
  * aab1e898c26c ("gre: let mac_header point to outer header only when
    necessary"),
  * 7ad136fd288c ("ipip: allow redirecting ipip and mplsip packets to
    eth devices").

Two end hosts ping each other, but they're separated by two routers.
The routers encapsulate the end host packets into gre or ipip packets.
On decap, the routers add an Ethernet header to the inner L3 packet and
forward the resulting frame to the end host.

A new topo_nschain_lib.sh file is created, to make the base network
settings reusable. The actual tests are implemented in
tc_redirect_l2l3.sh.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 .../testing/selftests/net/forwarding/Makefile |   1 +
 tools/testing/selftests/net/forwarding/config |   2 +
 .../net/forwarding/tc_redirect_l2l3.sh        | 287 ++++++++++++++++++
 .../net/forwarding/topo_nschain_lib.sh        | 267 ++++++++++++++++
 4 files changed, 557 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_redirect_l2l3.sh
 create mode 100644 tools/testing/selftests/net/forwarding/topo_nschain_lib.sh

Comments

Ido Schimmel July 1, 2021, 5:46 a.m. UTC | #1
On Wed, Jun 30, 2021 at 02:51:38PM +0200, Guillaume Nault wrote:
> diff --git a/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh b/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh
> new file mode 100644
> index 000000000000..4c0bf2d7328a
> --- /dev/null
> +++ b/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh
> @@ -0,0 +1,267 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# A chain of 4 nodes connected with veth pairs.
> +# Each node lives in its own network namespace.

Hi,

The tests under tools/testing/selftests/net/forwarding/ are meant to use
VRFs as lightweight namespaces. This allows us to run the tests on both
physical switches with loopback cables and veth pairs, thereby
validating both the hardware and software datapaths.

See tools/testing/selftests/net/forwarding/README

If the tests cannot be converted to VRFs, then I suggest moving them to
tools/testing/selftests/net/

> +# Each veth interface has an IPv4 and an IPv6 address. A host route provides
> +# connectivity to the adjacent node. This base network only allows nodes to
> +# communicate with their immediate neighbours.
> +#
> +# The two nodes at the extremities of the chain also have 4 host IPs on their
> +# loopback device:
> +#   * An IPv4 address, routed as is to the adjacent router.
> +#   * An IPv4 address, routed over MPLS to the adjacent router.
> +#   * An IPv6 address, routed as is to the adjacent router.
> +#   * An IPv6 address, routed over MPLS to the adjacent router.
> +#
> +# This topology doesn't define how RTA and RTB handle these packets: users of
> +# this script are responsible for the plumbing between RTA and RTB.
> +#
> +# As each veth connects two different namespaces, their MAC and IP addresses
> +# are defined depending on the local and remote namespaces. For example
> +# veth-h1-rta, which sits in H1 and links to RTA, has MAC address
> +# 00:00:5e:00:53:1a, IPv4 192.0.2.0x1a and IPv6 2001:db8::1a, where "1a" means
> +# that it's in H1 and links to RTA (the rest of each address is always built
> +# from a IANA documentation prefix).
> +#
> +# Routed addresses in H1 and H2 on the other hand encode the routing type (with
> +# or without MPLS encapsulation) and the namespace the address resides in. For
> +# example H2 has 198.51.100.2 and 2001:db8::1:2, that are routed as is through
> +# RTB. It also has 198.51.100.0x12 and 2001:db8::1:12, that are routed through
> +# RTB with MPLS encapsulation.
> +#
> +# For clarity, the prefixes used for host IPs are different from the ones used
> +# on the veths.
> +#
> +# The MPLS labels follow a similar principle: the first digit represents the
> +# IP version of the encapsulated packet ("4" for IPv4, "6" for IPv6), the
> +# second digit represents the destination host ("1" for H1, "2" for H2).
> +#
> +# +----------------------------------------------------+
> +# |                    Host 1 (H1)                     |
> +# |                                                    |
> +# |   lo                                               |
> +# |     198.51.100.1    (for plain IPv4)               |
> +# |     2001:db8::1:1   (for plain IPv6)               |
> +# |     198.51.100.0x11 (for IPv4 over MPLS, label 42) |
> +# |     2001:db8::1:11  (for IPv6 over MPLS, label 62) |
> +# |                                                    |
> +# | + veth-h1-rta                                      |
> +# | |   192.0.2.0x1a                                   |
> +# | |   2001:db8::1a                                   |
> +# +-|--------------------------------------------------+
> +#   |
> +# +-|--------------------+
> +# | |  Router A (RTA)    |
> +# | |                    |
> +# | + veth-rta-h1        |
> +# |     192.0.2.0xa1     |
> +# |     2001:db8::a1     |
> +# |                      |
> +# | + veth-rta-rtb       |
> +# | |   192.0.2.0xab     |
> +# | |   2001:db8::ab     |
> +# +-|--------------------+
> +#   |
> +# +-|--------------------+
> +# | |  Router B (RTB)    |
> +# | |                    |
> +# | + veth-rtb-rta       |
> +# |     192.0.2.0xba     |
> +# |     2001:db8::ba     |
> +# |                      |
> +# | + veth-rtb-h2        |
> +# | |   192.0.2.0xb2     |
> +# | |   2001:db8::b2     |
> +# +-|--------------------+
> +#   |
> +# +-|--------------------------------------------------+
> +# | |                  Host 2 (H2)                     |
> +# | |                                                  |
> +# | + veth-h2-rtb                                      |
> +# |     192.0.2.0x2b                                   |
> +# |     2001:db8::2b                                   |
> +# |                                                    |
> +# |   lo                                               |
> +# |     198.51.100.2    (for plain IPv4)               |
> +# |     2001:db8::1:2   (for plain IPv6)               |
> +# |     198.51.100.0x12 (for IPv4 over MPLS, label 41) |
> +# |     2001:db8::1:12  (for IPv6 over MPLS, label 61) |
> +# +----------------------------------------------------+
> +#
> +# This topology can be used for testing different routing or switching
> +# scenarios, as H1 and H2 are pre-configured for sending different kinds of
> +# packets (IPv4, IPv6, with or without MPLS encapsulation), which RTA and RTB
> +# can easily match and process according to the forwarding mechanism to test.
Guillaume Nault July 1, 2021, 2:59 p.m. UTC | #2
On Thu, Jul 01, 2021 at 08:46:46AM +0300, Ido Schimmel wrote:
> On Wed, Jun 30, 2021 at 02:51:38PM +0200, Guillaume Nault wrote:
> > diff --git a/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh b/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh
> > new file mode 100644
> > index 000000000000..4c0bf2d7328a
> > --- /dev/null
> > +++ b/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh
> > @@ -0,0 +1,267 @@
> > +#!/bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +# A chain of 4 nodes connected with veth pairs.
> > +# Each node lives in its own network namespace.
> 
> Hi,
> 
> The tests under tools/testing/selftests/net/forwarding/ are meant to use
> VRFs as lightweight namespaces. This allows us to run the tests on both
> physical switches with loopback cables and veth pairs, thereby
> validating both the hardware and software datapaths.
> 
> See tools/testing/selftests/net/forwarding/README

It wasn't clear to me that net/forwarding was _only_ for VRF-based
tests.

> If the tests cannot be converted to VRFs, then I suggest moving them to
> tools/testing/selftests/net/

I first tried to write this selftest using VRFs, but there were some
problems that made me switch to namespaces (I don't remember precisely
which ones, probably virtual tunnel devices in collect_md mode).

I'll give VRF another try. And if I can't get to a satisfying solution,
I'll move the selftest under net/.
David Ahern July 1, 2021, 3:38 p.m. UTC | #3
On 7/1/21 8:59 AM, Guillaume Nault wrote:
> I first tried to write this selftest using VRFs, but there were some
> problems that made me switch to namespaces (I don't remember precisely
> which ones, probably virtual tunnel devices in collect_md mode).

if you hit a problem with the test not working, send me the test script
and I will take a look.
Guillaume Nault July 1, 2021, 4:59 p.m. UTC | #4
On Thu, Jul 01, 2021 at 09:38:44AM -0600, David Ahern wrote:
> On 7/1/21 8:59 AM, Guillaume Nault wrote:
> > I first tried to write this selftest using VRFs, but there were some
> > problems that made me switch to namespaces (I don't remember precisely
> > which ones, probably virtual tunnel devices in collect_md mode).
> 
> if you hit a problem with the test not working, send me the test script
> and I will take a look.

Okay, thanks.
Guillaume Nault July 6, 2021, 7:02 p.m. UTC | #5
On Thu, Jul 01, 2021 at 09:38:44AM -0600, David Ahern wrote:
> On 7/1/21 8:59 AM, Guillaume Nault wrote:
> > I first tried to write this selftest using VRFs, but there were some
> > problems that made me switch to namespaces (I don't remember precisely
> > which ones, probably virtual tunnel devices in collect_md mode).
> 
> if you hit a problem with the test not working, send me the test script
> and I will take a look.

So I've looked again at what it'd take to make a VRF-based selftest.
The problem is that we currently can't create collect_md tunnel
interfaces in different VRFs, if the VRFs are part of the same netns.

Most tunnels explicitely refuse to create a collect_md device if
another one already exists in the netns, no matter the rest of the
tunnel parameters. This is the behaviour of ip_gre, ipip, ip6_gre and
ip6_tunnel.

Then there's sit, which allows the creation of the second collect_md
device in the other VRF. However, iproute2 doesn't set the
IFLA_IPTUN_LINK attribute when it creates an external device, so it
can't set up such a configuration.

Bareudp simply doesn't support VRF.

Finally, vxlan allows devices with different IFLA_VXLAN_LINK attributes
to be created, but only when VXLAN_F_IPV6_LINKLOCAL is set. Removing
the VXLAN_F_IPV6_LINKLOCAL test at the end of vxlan_config_validate()
is enough to make two VXLAN-GPE devices work in a multi-VRF setup:

--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3767,8 +3767,7 @@ static int vxlan_config_validate(struct net *src_net, struct vxlan_config *conf,
 		    (conf->flags & (VXLAN_F_RCV_FLAGS | VXLAN_F_IPV6)))
 			continue;
 
-		if ((conf->flags & VXLAN_F_IPV6_LINKLOCAL) &&
-		    tmp->cfg.remote_ifindex != conf->remote_ifindex)
+		if (tmp->cfg.remote_ifindex != conf->remote_ifindex)
 			continue;
 
 		NL_SET_ERR_MSG(extack,

Here's an example of what a full selftests looks like using VXLAN-GPE.
Without the patch above, creating the second vxlan interface fails
(EEXIST).

#!/bin/bash
# SPDX-License-Identifier: GPL-2.0

NUM_NETIFS=6

source lib.sh

VETH_H1_RTA=${NETIFS[p1]}
VETH_RTA_H1=${NETIFS[p2]}
VETH_RTA_RTB=${NETIFS[p3]}
VETH_RTB_RTA=${NETIFS[p4]}
VETH_RTB_H2=${NETIFS[p5]}
VETH_H2_RTB=${NETIFS[p6]}

MAC_H1_RTA=$(mac_get "${VETH_H1_RTA}")
MAC_RTA_H1=$(mac_get "${VETH_RTA_H1}")
MAC_RTB_H2=$(mac_get "${VETH_RTB_H2}")
MAC_H2_RTB=$(mac_get "${VETH_H2_RTB}")

VRF_H1="vrf-h1"
VRF_RTA="vrf-rta"
VRF_RTB="vrf-rtb"
VRF_H2="vrf-h2"

# Set up a chain of 4 VRFs connected with the veth interfaces:
#   H1 <-> RTA <-> RTB <-> H2
setup_base_net()
{
	# Initialise VRFs

	vrf_prepare

	for VRF in "${VRF_H1}" "${VRF_RTA}" "${VRF_RTB}" "${VRF_H2}"; do
		vrf_create "${VRF}"
		ip link set dev "${VRF}" up
	done

	# Assign each veth to its VRF

	__simple_if_init "${VETH_H1_RTA}" "${VRF_H1}"

	__simple_if_init "${VETH_RTA_H1}" "${VRF_RTA}"
	__simple_if_init "${VETH_RTA_RTB}" "${VRF_RTA}"

	__simple_if_init "${VETH_RTB_RTA}" "${VRF_RTB}"
	__simple_if_init "${VETH_RTB_H2}" "${VRF_RTB}"

	__simple_if_init "${VETH_H2_RTB}" "${VRF_H2}"

	# Let each veth communicate with its peer

	ip address add dev "${VETH_H1_RTA}" 192.0.2.0x1a peer 192.0.2.0xa1/32
	ip address add dev "${VETH_RTA_H1}" 192.0.2.0xa1 peer 192.0.2.0x1a/32

	ip address add dev "${VETH_RTA_RTB}" 192.0.2.0xab peer 192.0.2.0xba/32
	ip address add dev "${VETH_RTB_RTA}" 192.0.2.0xba peer 192.0.2.0xab/32

	ip address add dev "${VETH_RTB_H2}" 192.0.2.0xb2 peer 192.0.2.0x2b/32
	ip address add dev "${VETH_H2_RTB}" 192.0.2.0x2b peer 192.0.2.0xb2/32

	# Define host IPs for H1 and H2 and route them through RTA and RTB.
	# Don't set up routing inside RTA and RTB yet.

	ip address add 198.51.100.1/32 dev "${VETH_H1_RTA}"
	ip address add 198.51.100.2/32 dev "${VETH_H2_RTB}"

	ip route add 198.51.100.2/32 src 198.51.100.1 via 192.0.2.0xa1	\
		vrf "${VRF_H1}"
	ip route add 198.51.100.1/32 src 198.51.100.2 via 192.0.2.0xb2	\
		vrf "${VRF_H2}"
}

# Route H1 and H2 host IPs inside RTA and RTB using VXLAN-GPE encapsulation.
setup_vxlan_gpe()
{
	# Create an external VXLAN-GPE device in the intermediate VRFs

	ip link add name tunnel-rta up type vxlan	\
		dev "${VRF_RTA}" gpe external
	ip link add name tunnel-rtb up type vxlan	\
		dev "${VRF_RTB}" gpe external

	# Forward packets received from the end hosts through the tunnels

	tc qdisc add dev "${VETH_RTA_H1}" ingress
	tc filter add dev "${VETH_RTA_H1}" ingress		\
		protocol ipv4 flower dst_ip 198.51.100.2	\
		action tunnel_key set src_ip 192.0.2.0xab	\
			dst_ip 192.0.2.0xba id 10		\
		action mirred egress redirect dev tunnel-rta

	tc qdisc add dev "${VETH_RTB_H2}" ingress
	tc filter add dev "${VETH_RTB_H2}" ingress		\
		protocol ipv4 flower dst_ip 198.51.100.1	\
		action tunnel_key set src_ip 192.0.2.0xba	\
			dst_ip 192.0.2.0xab id 10		\
		action mirred egress redirect dev tunnel-rtb

	# Decapsulate packets received from the tunnels and send them to the
	# end hosts

	tc qdisc add dev tunnel-rta ingress
	tc filter add dev tunnel-rta ingress matchall			\
		action vlan push_eth dst_mac "${MAC_H1_RTA}"		\
			src_mac "${MAC_RTA_H1}"				\
		action mirred egress redirect dev "${VETH_RTA_H1}"

	tc qdisc add dev tunnel-rtb ingress
	tc filter add dev tunnel-rtb ingress matchall			\
		action vlan push_eth dst_mac "${MAC_H2_RTB}"		\
			src_mac "${MAC_RTB_H2}"				\
		action mirred egress redirect dev "${VETH_RTB_H2}"
}

setup_base_net
setup_vxlan_gpe

ip vrf exec "${VRF_H1}" ping 198.51.100.2
David Ahern July 7, 2021, 3:05 p.m. UTC | #6
On 7/6/21 1:02 PM, Guillaume Nault wrote:
> On Thu, Jul 01, 2021 at 09:38:44AM -0600, David Ahern wrote:

>> On 7/1/21 8:59 AM, Guillaume Nault wrote:

>>> I first tried to write this selftest using VRFs, but there were some

>>> problems that made me switch to namespaces (I don't remember precisely

>>> which ones, probably virtual tunnel devices in collect_md mode).

>>

>> if you hit a problem with the test not working, send me the test script

>> and I will take a look.

> 

> So I've looked again at what it'd take to make a VRF-based selftest.

> The problem is that we currently can't create collect_md tunnel

> interfaces in different VRFs, if the VRFs are part of the same netns.

> 

> Most tunnels explicitely refuse to create a collect_md device if

> another one already exists in the netns, no matter the rest of the

> tunnel parameters. This is the behaviour of ip_gre, ipip, ip6_gre and

> ip6_tunnel.

> 

> Then there's sit, which allows the creation of the second collect_md

> device in the other VRF. However, iproute2 doesn't set the

> IFLA_IPTUN_LINK attribute when it creates an external device, so it

> can't set up such a configuration.

> 

> Bareudp simply doesn't support VRF.

> 

> Finally, vxlan allows devices with different IFLA_VXLAN_LINK attributes

> to be created, but only when VXLAN_F_IPV6_LINKLOCAL is set. Removing

> the VXLAN_F_IPV6_LINKLOCAL test at the end of vxlan_config_validate()

> is enough to make two VXLAN-GPE devices work in a multi-VRF setup:


Thanks for the details. In short, some work is needed to extend VRF
support to these tunnels. That is worth doing if you have the time.
David Ahern July 8, 2021, 1:50 a.m. UTC | #7
On 7/6/21 1:02 PM, Guillaume Nault wrote:
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -3767,8 +3767,7 @@ static int vxlan_config_validate(struct net *src_net, struct vxlan_config *conf,
>  		    (conf->flags & (VXLAN_F_RCV_FLAGS | VXLAN_F_IPV6)))
>  			continue;
>  
> -		if ((conf->flags & VXLAN_F_IPV6_LINKLOCAL) &&
> -		    tmp->cfg.remote_ifindex != conf->remote_ifindex)
> +		if (tmp->cfg.remote_ifindex != conf->remote_ifindex)
>  			continue;
>  
>  		NL_SET_ERR_MSG(extack,

Looking at the vxlan driver and that restriction is unnecessary. While
IPv6 LLA requires a device index, allowing separate LINK attributes is a
legit use case - as VRF shows.

Do you want to send a formal patch to fix this one since you have it
diagnosed?
Guillaume Nault July 9, 2021, 4:21 p.m. UTC | #8
On Wed, Jul 07, 2021 at 07:50:38PM -0600, David Ahern wrote:
> On 7/6/21 1:02 PM, Guillaume Nault wrote:
> > --- a/drivers/net/vxlan.c
> > +++ b/drivers/net/vxlan.c
> > @@ -3767,8 +3767,7 @@ static int vxlan_config_validate(struct net *src_net, struct vxlan_config *conf,
> >  		    (conf->flags & (VXLAN_F_RCV_FLAGS | VXLAN_F_IPV6)))
> >  			continue;
> >  
> > -		if ((conf->flags & VXLAN_F_IPV6_LINKLOCAL) &&
> > -		    tmp->cfg.remote_ifindex != conf->remote_ifindex)
> > +		if (tmp->cfg.remote_ifindex != conf->remote_ifindex)
> >  			continue;
> >  
> >  		NL_SET_ERR_MSG(extack,
> 
> Looking at the vxlan driver and that restriction is unnecessary. While
> IPv6 LLA requires a device index, allowing separate LINK attributes is a
> legit use case - as VRF shows.

I believe there's more to do than just my quick hack patch. At least
vxlan_vs_find_vni() probably needs to be modified too. At which point
VXLAN_F_IPV6_LINKLOCAL would become essentially unused and should be
removed.

> Do you want to send a formal patch to fix this one since you have it
> diagnosed?

I'll work on that after net-next reopens. I might try to make bareudp
compatible with VRF too, if time allows. If I get at least VXLAN and
bareudp to work in a multi-VRF setup, then I'll rebase the selftests on
the VRF infrastructure (and just remove the tests for tunnels still
incompatible with this setup).
diff mbox series

Patch

diff --git a/tools/testing/selftests/net/forwarding/Makefile b/tools/testing/selftests/net/forwarding/Makefile
index d97bd6889446..a5005b01f14c 100644
--- a/tools/testing/selftests/net/forwarding/Makefile
+++ b/tools/testing/selftests/net/forwarding/Makefile
@@ -49,6 +49,7 @@  TEST_PROGS = bridge_igmp.sh \
 	tc_flower_router.sh \
 	tc_flower.sh \
 	tc_mpls_l2vpn.sh \
+	tc_redirect_l2l3.sh \
 	tc_shblocks.sh \
 	tc_vlan_modify.sh \
 	vxlan_asymmetric.sh \
diff --git a/tools/testing/selftests/net/forwarding/config b/tools/testing/selftests/net/forwarding/config
index a4bd1b087303..9d28f801866f 100644
--- a/tools/testing/selftests/net/forwarding/config
+++ b/tools/testing/selftests/net/forwarding/config
@@ -16,3 +16,5 @@  CONFIG_NET_ACT_GACT=m
 CONFIG_VETH=m
 CONFIG_NAMESPACES=y
 CONFIG_NET_NS=y
+CONFIG_NET_IPGRE=m
+CONFIG_NET_IPIP=m
diff --git a/tools/testing/selftests/net/forwarding/tc_redirect_l2l3.sh b/tools/testing/selftests/net/forwarding/tc_redirect_l2l3.sh
new file mode 100755
index 000000000000..3e69b5deb608
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_redirect_l2l3.sh
@@ -0,0 +1,287 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Test redirecting frames received on L3 tunnel interfaces to an Ethernet
+# interface, after having added an Ethernet header.
+#
+# Example:
+#
+#  $ tc filter add dev ipip0 ingress matchall          \
+#       action vlan push_eth dst_mac 00:00:5e:00:53:01 \
+#                            src_mac 00:00:5e:00:53:00 \
+#       action mirred egress redirect dev eth0
+#
+# Network topology is as follow: H1 and H2 are end hosts, separated by two
+# routers, RTA and RTB. They ping each other using IPv4, IPv6, IPv4 in MPLS
+# and IPv6 in MPLS packets. The L3 tunnel to test is established between RTA
+# and RTB. These routers redirect packets from the tunnel to the end host's
+# veth and the other way around.
+#
+# This scripts only needs to define how packets are forwarded between RTA and
+# RTB (as that's where we do and test the tunnel encapsulation and packet
+# redirection). The base network configuration is done in topo_nschain_lib.sh.
+
+ALL_TESTS="
+	redir_gre
+	redir_ipip
+"
+
+NUM_NETIFS=0
+
+source topo_nschain_lib.sh
+source lib.sh
+
+readonly KSFT_PASS=0
+readonly KSFT_FAIL=1
+readonly KSFT_SKIP=4
+
+KSFT_RET="${KSFT_PASS}"
+TESTS_COMPLETED="no"
+
+# Create tunnels between RTA and RTB, and forward packets between tunnel and
+# veth interfaces.
+#
+# Parameters:
+#
+#   * $1: IP version of the underlay to use ("ipv4" or "ipv6").
+#   * $2: Tunnel mode (either "classical" or "collect_md").
+#   * $3: Device type (as in "ip link add mydev type <dev_type> ...").
+#   * $4: Options for the "ip link add" command
+#         (as in "ip link add mydev type dev_type <options>").
+#   * $5: Options for the TC tunnel_key command
+#         (as in "tc filter add ... action tunnel_key set <options>").
+#
+# For classical tunnels, the "local" and "remote" options of "ip link add" are
+# set automatically and mustn't appear in $4.
+#
+# For collect_md tunnels, the "src_ip" and "dst_ip" options of
+# "action tunnel_key" are set automatically and mustn't appear in $5.
+#
+setup_tunnel()
+{
+	local UNDERLAY_PROTO="$1"; readonly UNDERLAY_PROTO
+	local TUNNEL_MODE="$2"; readonly TUNNEL_MODE
+	local DEV_TYPE="$3"; readonly DEV_TYPE
+	local DEV_OPTS="$4"; readonly DEV_OPTS
+	local TK_OPTS="$5"; readonly TK_OPTS
+	local RTA_TUNNEL_OPTS
+	local RTB_TUNNEL_OPTS
+	local RTA_TK_ACTION
+	local RTB_TK_ACTION
+	local IP_RTA
+	local IP_RTB
+
+	case "${UNDERLAY_PROTO}" in
+		"IPv4"|"ipv4")
+			IP_RTA="192.0.2.0xab"
+			IP_RTB="192.0.2.0xba"
+			;;
+		"IPv6"|"ipv6")
+			IP_RTA="2001:db8::ab"
+			IP_RTB="2001:db8::ba"
+			;;
+		*)
+			exit 1
+			;;
+	esac
+
+	case "${TUNNEL_MODE}" in
+		"classical")
+			# Classical tunnel: underlay IP addresses are part of
+			# the tunnel configuration.
+			RTA_TUNNEL_OPTS="local ${IP_RTA} remote ${IP_RTB} ${DEV_OPTS}"
+			RTB_TUNNEL_OPTS="local ${IP_RTB} remote ${IP_RTA} ${DEV_OPTS}"
+			RTA_TK_ACTION=""
+			RTB_TK_ACTION=""
+			;;
+		"collect_md")
+			# External tunnel: underlay IP addresses are attached
+			# to the packets' metadata with the tunnel_key action
+			RTA_TUNNEL_OPTS="${DEV_OPTS}"
+			RTB_TUNNEL_OPTS="${DEV_OPTS}"
+			RTA_TK_ACTION="action tunnel_key set src_ip ${IP_RTA} dst_ip ${IP_RTB} ${TK_OPTS}"
+			RTB_TK_ACTION="action tunnel_key set src_ip ${IP_RTB} dst_ip ${IP_RTA} ${TK_OPTS}"
+			;;
+		*)
+		    echo "Internal error: setup_tunnel(): invalid tunnel mode \"${TUNNEL_MODE}\""
+		    return 1
+		    ;;
+	esac
+
+	# Transform options strings to arrays, so we can pass them to the ip or
+	# tc commands with double quotes (prevents shellcheck warning).
+	read -ra RTA_TUNNEL_OPTS <<< "${RTA_TUNNEL_OPTS}"
+	read -ra RTB_TUNNEL_OPTS <<< "${RTB_TUNNEL_OPTS}"
+	read -ra RTA_TK_ACTION <<< "${RTA_TK_ACTION}"
+	read -ra RTB_TK_ACTION <<< "${RTB_TK_ACTION}"
+
+	tc -netns "${RTA}" qdisc add dev veth-rta-h1 ingress
+	tc -netns "${RTB}" qdisc add dev veth-rtb-h2 ingress
+
+	ip -netns "${RTA}" link add name tunnel-rta up type "${DEV_TYPE}" \
+		"${RTA_TUNNEL_OPTS[@]}"
+	ip -netns "${RTB}" link add name tunnel-rtb up type "${DEV_TYPE}" \
+		"${RTB_TUNNEL_OPTS[@]}"
+
+	# Encapsulate IPv4 packets
+	tc -netns "${RTA}" filter add dev veth-rta-h1 ingress	\
+		protocol ipv4 flower dst_ip 198.51.100.2	\
+		"${RTA_TK_ACTION[@]}"				\
+		action mirred egress redirect dev tunnel-rta
+	tc -netns "${RTB}" filter add dev veth-rtb-h2 ingress	\
+		protocol ipv4 flower dst_ip 198.51.100.1	\
+		"${RTB_TK_ACTION[@]}"				\
+		action mirred egress redirect dev tunnel-rtb
+
+	# Encapsulate IPv6 packets
+	tc -netns "${RTA}" filter add dev veth-rta-h1 ingress	\
+		protocol ipv6 flower dst_ip 2001:db8::1:2	\
+		"${RTA_TK_ACTION[@]}"				\
+		action mirred egress redirect dev tunnel-rta
+	tc -netns "${RTB}" filter add dev veth-rtb-h2 ingress	\
+		protocol ipv6 flower dst_ip 2001:db8::1:1	\
+		"${RTB_TK_ACTION[@]}"				\
+		action mirred egress redirect dev tunnel-rtb
+
+	# Encapsulate MPLS packets
+	tc -netns "${RTA}" filter add dev veth-rta-h1 ingress	\
+		protocol mpls_uc matchall			\
+		"${RTA_TK_ACTION[@]}"				\
+		action mirred egress redirect dev tunnel-rta
+	tc -netns "${RTB}" filter add dev veth-rtb-h2 ingress	\
+		protocol mpls_uc matchall			\
+		"${RTB_TK_ACTION[@]}"				\
+		action mirred egress redirect dev tunnel-rtb
+
+	tc -netns "${RTA}" qdisc add dev tunnel-rta ingress
+	tc -netns "${RTB}" qdisc add dev tunnel-rtb ingress
+
+	# Redirect packets from tunnel devices to end hosts
+	tc -netns "${RTA}" filter add dev tunnel-rta ingress matchall	\
+		action vlan push_eth dst_mac 00:00:5e:00:53:1a		\
+			src_mac 00:00:5e:00:53:a1			\
+		action mirred egress redirect dev veth-rta-h1
+	tc -netns "${RTB}" filter add dev tunnel-rtb ingress matchall	\
+		action vlan push_eth dst_mac 00:00:5e:00:53:2b		\
+			src_mac 00:00:5e:00:53:b2			\
+		action mirred egress redirect dev veth-rtb-h2
+}
+
+# Remove everything that was created by setup_tunnel().
+#
+cleanup_tunnel()
+{
+	ip -netns "${RTB}" link delete dev tunnel-rtb
+	ip -netns "${RTA}" link delete dev tunnel-rta
+	tc -netns "${RTB}" qdisc delete dev veth-rtb-h2 ingress
+	tc -netns "${RTA}" qdisc delete dev veth-rta-h1 ingress
+}
+
+# Ping H2 from H1.
+#
+# Parameters:
+#
+#   $1: The protocol used for the ping test:
+#         * ipv4: use plain IPv4 packets,
+#         * ipv6: use plain IPv6 packets,
+#         * ipv4-mpls: use IPv4 packets encapsulated into MPLS,
+#         * ipv6-mpls: use IPv6 packets encapsulated into MPLS.
+#   $2: Description of the test.
+#
+ping_test()
+{
+	local PROTO="$1"; readonly PROTO
+	local MSG="$2"; readonly MSG
+	local PING_CMD
+	local PING_IP
+
+	case "${PROTO}" in
+		"ipv4")
+			PING_CMD="${PING}"
+			PING_IP="198.51.100.2"
+			;;
+		"ipv6")
+			PING_CMD="${PING6}"
+			PING_IP="2001:db8::1:2"
+			;;
+		"ipv4-mpls")
+			PING_CMD="${PING}"
+			PING_IP="198.51.100.0x12"
+			;;
+		"ipv6-mpls")
+			PING_CMD="${PING6}"
+			PING_IP="2001:db8::1:12"
+			;;
+		*)
+			echo "Internal error: ping_test(): invalid protocol \"${PROTO}\""
+			return 1
+			;;
+	esac
+
+	set +e
+	RET=0
+	ip netns exec "${H1}" "${PING_CMD}" -w "${PING_TIMEOUT}" -c 1 "${PING_IP}" > /dev/null 2>&1
+	RET=$?
+	log_test "${MSG}" || KSFT_RET="${KSFT_FAIL}"
+	set -e
+}
+
+redir_gre()
+{
+	setup_tunnel "ipv4" "classical" "gre"
+	ping_test ipv4 "GRE, classical mode: IPv4 / GRE / IPv4"
+	ping_test ipv6 "GRE, classical mode: IPv4 / GRE / IPv6"
+	ping_test ipv4-mpls "GRE, classical mode: IPv4 / GRE / MPLS / IPv4"
+	ping_test ipv6-mpls "GRE, classical mode: IPv4 / GRE / MPLS / IPv6"
+	cleanup_tunnel
+
+	setup_tunnel "ipv4" "collect_md" "gre" "external" "nocsum"
+	ping_test ipv4 "GRE, external mode: IPv4 / GRE / IPv4"
+	ping_test ipv6 "GRE, external mode: IPv4 / GRE / IPv6"
+	ping_test ipv4-mpls "GRE, external mode: IPv4 / GRE / MPLS / IPv4"
+	ping_test ipv6-mpls "GRE, external mode: IPv4 / GRE / MPLS / IPv6"
+	cleanup_tunnel
+}
+
+redir_ipip()
+{
+	setup_tunnel "ipv4" "classical" "ipip" "mode any"
+	ping_test ipv4 "IPIP, classical mode: IPv4 / IPv4"
+	ping_test ipv4-mpls "IPIP, classical mode: IPv4 / MPLS / IPv4"
+	ping_test ipv6-mpls "IPIP, classical mode: IPv4 / MPLS / IPv6"
+	cleanup_tunnel
+
+	setup_tunnel "ipv4" "collect_md" "ipip" "mode any external"
+	ping_test ipv4 "IPIP, external mode: IPv4 / IPv4"
+	ping_test ipv4-mpls "IPIP, external mode: IPv4 / MPLS / IPv4"
+	ping_test ipv6-mpls "IPIP, external mode: IPv4 / MPLS / IPv6"
+	cleanup_tunnel
+}
+
+exit_cleanup()
+{
+	if [ "${TESTS_COMPLETED}" = "no" ]; then
+		KSFT_RET="${KSFT_FAIL}"
+	fi
+
+	pre_cleanup
+	nsc_cleanup_ns "${H1}" "${RTA}" "${RTB}" "${H2}"
+	exit "${KSFT_RET}"
+}
+
+
+if ! tc actions add action vlan help 2>&1 | grep --quiet 'push_eth'; then
+	echo "SKIP: iproute2 is too old: tc doesn't support action \"push_eth\""
+	exit "${KSFT_SKIP}"
+fi
+
+nsc_setup_ns || exit "${KSFT_FAIL}"
+
+set -e
+trap exit_cleanup EXIT
+
+nsc_setup_base_net
+nsc_setup_hosts_net
+
+tests_run
+TESTS_COMPLETED="yes"
diff --git a/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh b/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh
new file mode 100644
index 000000000000..4c0bf2d7328a
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/topo_nschain_lib.sh
@@ -0,0 +1,267 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# A chain of 4 nodes connected with veth pairs.
+# Each node lives in its own network namespace.
+# Each veth interface has an IPv4 and an IPv6 address. A host route provides
+# connectivity to the adjacent node. This base network only allows nodes to
+# communicate with their immediate neighbours.
+#
+# The two nodes at the extremities of the chain also have 4 host IPs on their
+# loopback device:
+#   * An IPv4 address, routed as is to the adjacent router.
+#   * An IPv4 address, routed over MPLS to the adjacent router.
+#   * An IPv6 address, routed as is to the adjacent router.
+#   * An IPv6 address, routed over MPLS to the adjacent router.
+#
+# This topology doesn't define how RTA and RTB handle these packets: users of
+# this script are responsible for the plumbing between RTA and RTB.
+#
+# As each veth connects two different namespaces, their MAC and IP addresses
+# are defined depending on the local and remote namespaces. For example
+# veth-h1-rta, which sits in H1 and links to RTA, has MAC address
+# 00:00:5e:00:53:1a, IPv4 192.0.2.0x1a and IPv6 2001:db8::1a, where "1a" means
+# that it's in H1 and links to RTA (the rest of each address is always built
+# from a IANA documentation prefix).
+#
+# Routed addresses in H1 and H2 on the other hand encode the routing type (with
+# or without MPLS encapsulation) and the namespace the address resides in. For
+# example H2 has 198.51.100.2 and 2001:db8::1:2, that are routed as is through
+# RTB. It also has 198.51.100.0x12 and 2001:db8::1:12, that are routed through
+# RTB with MPLS encapsulation.
+#
+# For clarity, the prefixes used for host IPs are different from the ones used
+# on the veths.
+#
+# The MPLS labels follow a similar principle: the first digit represents the
+# IP version of the encapsulated packet ("4" for IPv4, "6" for IPv6), the
+# second digit represents the destination host ("1" for H1, "2" for H2).
+#
+# +----------------------------------------------------+
+# |                    Host 1 (H1)                     |
+# |                                                    |
+# |   lo                                               |
+# |     198.51.100.1    (for plain IPv4)               |
+# |     2001:db8::1:1   (for plain IPv6)               |
+# |     198.51.100.0x11 (for IPv4 over MPLS, label 42) |
+# |     2001:db8::1:11  (for IPv6 over MPLS, label 62) |
+# |                                                    |
+# | + veth-h1-rta                                      |
+# | |   192.0.2.0x1a                                   |
+# | |   2001:db8::1a                                   |
+# +-|--------------------------------------------------+
+#   |
+# +-|--------------------+
+# | |  Router A (RTA)    |
+# | |                    |
+# | + veth-rta-h1        |
+# |     192.0.2.0xa1     |
+# |     2001:db8::a1     |
+# |                      |
+# | + veth-rta-rtb       |
+# | |   192.0.2.0xab     |
+# | |   2001:db8::ab     |
+# +-|--------------------+
+#   |
+# +-|--------------------+
+# | |  Router B (RTB)    |
+# | |                    |
+# | + veth-rtb-rta       |
+# |     192.0.2.0xba     |
+# |     2001:db8::ba     |
+# |                      |
+# | + veth-rtb-h2        |
+# | |   192.0.2.0xb2     |
+# | |   2001:db8::b2     |
+# +-|--------------------+
+#   |
+# +-|--------------------------------------------------+
+# | |                  Host 2 (H2)                     |
+# | |                                                  |
+# | + veth-h2-rtb                                      |
+# |     192.0.2.0x2b                                   |
+# |     2001:db8::2b                                   |
+# |                                                    |
+# |   lo                                               |
+# |     198.51.100.2    (for plain IPv4)               |
+# |     2001:db8::1:2   (for plain IPv6)               |
+# |     198.51.100.0x12 (for IPv4 over MPLS, label 41) |
+# |     2001:db8::1:12  (for IPv6 over MPLS, label 61) |
+# +----------------------------------------------------+
+#
+# This topology can be used for testing different routing or switching
+# scenarios, as H1 and H2 are pre-configured for sending different kinds of
+# packets (IPv4, IPv6, with or without MPLS encapsulation), which RTA and RTB
+# can easily match and process according to the forwarding mechanism to test.
+
+readonly H1=$(mktemp -u h1-XXXXXXXX)
+readonly RTA=$(mktemp -u rta-XXXXXXXX)
+readonly RTB=$(mktemp -u rtb-XXXXXXXX)
+readonly H2=$(mktemp -u h2-XXXXXXXX)
+
+# Create and configure a veth pair between two network namespaces A and B
+#
+# Parameters:
+#
+#   * $1: Name of netns A.
+#   * $2: Name of netns B.
+#   * $3: Name of the veth device to create in netns A.
+#   * $4: Name of the veth device to create in netns B.
+#   * $5: Identifier used to configure IP and MAC addresses in netns A.
+#   * $6: Identifier used to configure IP and MAC addresses in netns B.
+#
+# The identifiers are a one byte long integer given in hexadecimal format
+# (without a "0x" prefix). They're used as the lowest order byte for the MAC,
+# IPv4 and IPv6 addresses.
+#
+nsc_veth_config()
+{
+	local NS_A="${1}"; readonly NS_A
+	local NS_B="${2}"; readonly NS_B
+	local DEV_A="${3}"; readonly DEV_A
+	local DEV_B="${4}"; readonly DEV_B
+	local ID_A="${5}"; readonly ID_A
+	local ID_B="${6}"; readonly ID_B
+
+	ip link add name "${DEV_A}" address 00:00:5e:00:53:"${ID_A}"	\
+		netns "${NS_A}" type veth peer name "${DEV_B}"		\
+		address 00:00:5e:00:53:"${ID_B}" netns "${NS_B}"
+	ip -netns "${NS_A}" link set dev "${DEV_A}" up
+	ip -netns "${NS_B}" link set dev "${DEV_B}" up
+
+	ip -netns "${NS_A}" address add dev "${DEV_A}"		\
+		192.0.2.0x"${ID_A}" peer 192.0.2.0x"${ID_B}"/32
+	ip -netns "${NS_B}" address add dev "${DEV_B}"		\
+		192.0.2.0x"${ID_B}" peer 192.0.2.0x"${ID_A}"/32
+
+	ip -netns "${NS_A}" address add dev "${DEV_A}"			\
+		2001:db8::"${ID_A}" peer 2001:db8::"${ID_B}" nodad
+	ip -netns "${NS_B}" address add dev "${DEV_B}"			\
+		2001:db8::"${ID_B}" peer 2001:db8::"${ID_A}" nodad
+}
+
+# Add host IP addresses to the loopback device and route them to the adjacent
+# router.
+#
+# Parameters:
+#
+#   $1: Name of the netns to configure.
+#   $2: Identifier used to configure the local IP address.
+#   $3: Identifier used to configure the remote IP address.
+#   $4: IPv4 address of the adjacent router.
+#   $5: IPv6 address of the adjacent router.
+#   $6: Name of the network interface that links to the adjacent router.
+#
+# The identifiers are a one byte long integer given in hexadecimal format
+# (without a "0x" prefix). They're used as the lowest order byte for the IPv4
+# and IPv6 addresses.
+#
+nsc_lo_config()
+{
+	local NS="${1}"; readonly NS
+	local LOCAL_NSID="${2}"; readonly LOCAL_NSID
+	local PEER_NSID="${3}"; readonly PEER_NSID
+	local GW_IP4="${4}"; readonly GW_IP4
+	local GW_IP6="${5}"; readonly GW_IP6
+	local IFACE="${6}"; readonly IFACE
+
+	# For testing plain IPv4 traffic
+	ip -netns "${NS}" address add 198.51.100.0x"${LOCAL_NSID}"/32 dev lo
+	ip -netns "${NS}" route add 198.51.100.0x"${PEER_NSID}"/32	\
+		src 198.51.100.0x"${LOCAL_NSID}" via "${GW_IP4}"
+
+	# For testing plain IPv6 traffic
+	ip -netns "${NS}" address add 2001:db8::1:"${LOCAL_NSID}"/128 dev lo
+	ip -netns "${NS}" route add 2001:db8::1:"${PEER_NSID}"/128	\
+		src 2001:db8::1:"${LOCAL_NSID}" via "${GW_IP6}"
+
+	# For testing IPv4 over MPLS traffic
+	ip -netns "${NS}" address add 198.51.100.0x1"${LOCAL_NSID}"/32 dev lo
+	ip -netns "${NS}" route add 198.51.100.0x1"${PEER_NSID}"/32	\
+		src 198.51.100.0x1"${LOCAL_NSID}"			\
+		encap mpls 4"${PEER_NSID}" via "${GW_IP4}"
+
+	# For testing IPv6 over MPLS traffic
+	ip -netns "${NS}" address add 2001:db8::1:1"${LOCAL_NSID}"/128 dev lo
+	ip -netns "${NS}" route add 2001:db8::1:1"${PEER_NSID}"/128	\
+		src 2001:db8::1:1"${LOCAL_NSID}"			\
+		encap mpls 6"${PEER_NSID}" via "${GW_IP6}"
+
+	# Allow MPLS traffic
+	ip netns exec "${NS}" sysctl -qw net.mpls.platform_labels=100
+	ip netns exec "${NS}" sysctl -qw net.mpls.conf."${IFACE}".input=1
+
+	# Deliver MPLS packets locally
+	ip -netns "${NS}" -family mpls route add 4"${LOCAL_NSID}" dev lo
+	ip -netns "${NS}" -family mpls route add 6"${LOCAL_NSID}" dev lo
+}
+
+# Remove the network namespaces
+#
+# Parameters:
+#
+#   * The list of network namespaces to delete.
+#
+nsc_cleanup_ns()
+{
+	for ns in "$@"; do
+		ip netns delete "${ns}" 2>/dev/null || true
+	done
+}
+
+# Remove the network namespaces and return error
+#
+# Parameters:
+#
+#   * The list of network namespaces to delete.
+#
+nsc_err_cleanup_ns()
+{
+	nsc_cleanup_ns "$@"
+	return 1
+}
+
+# Create the four network namespaces (H1, RTA, RTB and H2)
+#
+nsc_setup_ns()
+{
+	ip netns add "${H1}" || nsc_err_cleanup_ns
+	ip netns add "${RTA}" || nsc_err_cleanup_ns "${H1}"
+	ip netns add "${RTB}" || nsc_err_cleanup_ns "${H1}" "${RTA}"
+	ip netns add "${H2}" || nsc_err_cleanup_ns "${H1}" "${RTA}" "${RTB}"
+}
+
+# Create base networking topology:
+#
+#   * Set up the loopback device in all network namespaces.
+#   * Create a veth pair to connect each netns in sequence.
+#   * Add an IPv4 and an IPv6 address on each veth interface.
+#
+# Requires the network namespaces to already exist (see nsc_setup_ns()).
+#
+nsc_setup_base_net()
+{
+	for ns in "${H1}" "${RTA}" "${RTB}" "${H2}"; do
+		ip -netns "${ns}" link set dev lo up
+	done;
+
+	nsc_veth_config "${H1}" "${RTA}" veth-h1-rta veth-rta-h1 1a a1
+	nsc_veth_config "${RTA}" "${RTB}" veth-rta-rtb veth-rtb-rta ab ba
+	nsc_veth_config "${RTB}" "${H2}" veth-rtb-h2 veth-h2-rtb b2 2b
+}
+
+# Configure the host IP addresses and routes in H1 and H2:
+#
+#   * Define the four host IP addresses on the loopback device of H1 and H2.
+#   * Route these addresses in H1 and H2 through the adjacent router (with MPLS
+#     encapsulation for two of them).
+#   * No routing is defined between RTA and RTB, that's the responsibility of
+#     the calling script.
+#
+# Requires the base network to be configured (see nsc_setup_base_net()).
+#
+nsc_setup_hosts_net()
+{
+	nsc_lo_config "${H1}" 1 2 192.0.2.0xa1 2001:db8::a1 veth-h1-rta
+	nsc_lo_config "${H2}" 2 1 192.0.2.0xb2 2001:db8::b2 veth-h2-rtb
+}