Message ID | 20210815120002.2787653-1-lschlesinger@drivenets.com |
---|---|
State | New |
Headers | show |
Series | vrf: Reset skb conntrack connection on VRF rcv | expand |
On 8/15/21 6:00 AM, Lahav Schlesinger wrote: > To fix the "reverse-NAT" for replies. > > When a packet is sent over a VRF, the POST_ROUTING hooks are called > twice: Once from the VRF interface, and once from the "actual" > interface the packet will be sent from: > 1) First SNAT: l3mdev_l3_out() -> vrf_l3_out() -> .. -> vrf_output_direct() > This causes the POST_ROUTING hooks to run. > 2) Second SNAT: 'ip_output()' calls POST_ROUTING hooks again. > > Similarly for replies, first ip_rcv() calls PRE_ROUTING hooks, and > second vrf_l3_rcv() calls them again. > > As an example, consider the following SNAT rule: >> iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j SNAT --to-source 2.2.2.2 -o vrf_1 > > In this case sending over a VRF will create 2 conntrack entries. > The first is from the VRF interface, which performs the IP SNAT. > The second will run the SNAT, but since the "expected reply" will remain > the same, conntrack randomizes the source port of the packet: > e..g With a socket bound to 1.1.1.1:10000, sending to 3.3.3.3:53, the conntrack > rules are: > udp 17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=0 bytes=0 mark=0 use=1 > udp 17 29 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1 > > i.e. First SNAT IP from 1.1.1.1 --> 2.2.2.2, and second the src port is > SNAT-ed from 10000 --> 61033. > > But when a reply is sent (3.3.3.3:53 -> 2.2.2.2:61033) only the later > conntrack entry is matched: > udp 17 29 src=2.2.2.2 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 src=3.3.3.3 dst=2.2.2.2 sport=53 dport=61033 packets=1 bytes=49 mark=0 use=1 > udp 17 28 src=1.1.1.1 dst=3.3.3.3 sport=10000 dport=53 packets=1 bytes=68 [UNREPLIED] src=3.3.3.3 dst=2.2.2.2 sport=53 dport=10000 packets=0 bytes=0 mark=0 use=1 > > And a "port 61033 unreachable" ICMP packet is sent back. > > The issue is that when PRE_ROUTING hooks are called from vrf_l3_rcv(), > the skb already has a conntrack flow attached to it, which means > nf_conntrack_in() will not resolve the flow again. > > This means only the dest port is "reverse-NATed" (61033 -> 10000) but > the dest IP remains 2.2.2.2, and since the socket is bound to 1.1.1.1 it's > not received. > This can be verified by logging the 4-tuple of the packet in '__udp4_lib_rcv()'. > > The fix is then to reset the flow when skb is received on a VRF, to let > conntrack resolve the flow again (which now will hit the earlier flow). > > To reproduce: (Without the fix "Got pkt_to_nat_port" will not be printed by > running 'bash ./repro'): > $ cat run_in_A1.py > import logging > logging.getLogger("scapy.runtime").setLevel(logging.ERROR) > from scapy.all import * > import argparse > > def get_packet_to_send(udp_dst_port, msg_name): > return Ether(src='11:22:33:44:55:66', dst=iface_mac)/ \ > IP(src='3.3.3.3', dst='2.2.2.2')/ \ > UDP(sport=53, dport=udp_dst_port)/ \ > Raw(f'{msg_name}\x0012345678901234567890') > > parser = argparse.ArgumentParser() > parser.add_argument('-iface_mac', dest="iface_mac", type=str, required=True, > help="From run_in_A3.py") > parser.add_argument('-socket_port', dest="socket_port", type=str, > required=True, help="From run_in_A3.py") > parser.add_argument('-v1_mac', dest="v1_mac", type=str, required=True, > help="From script") > > args, _ = parser.parse_known_args() > iface_mac = args.iface_mac > socket_port = int(args.socket_port) > v1_mac = args.v1_mac > > print(f'Source port before NAT: {socket_port}') > > while True: > pkts = sniff(iface='_v0', store=True, count=1, timeout=10) > if 0 == len(pkts): > print('Something failed, rerun the script :(', flush=True) > break > pkt = pkts[0] > if not pkt.haslayer('UDP'): > continue > > pkt_sport = pkt.getlayer('UDP').sport > print(f'Source port after NAT: {pkt_sport}', flush=True) > > pkt_to_send = get_packet_to_send(pkt_sport, 'pkt_to_nat_port') > sendp(pkt_to_send, '_v0', verbose=False) # Will not be received > > pkt_to_send = get_packet_to_send(socket_port, 'pkt_to_socket_port') > sendp(pkt_to_send, '_v0', verbose=False) > break > > $ cat run_in_A2.py > import socket > import netifaces > > print(f"{netifaces.ifaddresses('e00000')[netifaces.AF_LINK][0]['addr']}", > flush=True) > s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) > s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, > str('vrf_1' + '\0').encode('utf-8')) > s.connect(('3.3.3.3', 53)) > print(f'{s. getsockname()[1]}', flush=True) > s.settimeout(5) > > while True: > try: > # Periodically send in order to keep the conntrack entry alive. > s.send(b'a'*40) > resp = s.recvfrom(1024) > msg_name = resp[0].decode('utf-8').split('\0')[0] > print(f"Got {msg_name}", flush=True) > except Exception as e: > pass > > $ cat repro.sh > ip netns del A1 2> /dev/null > ip netns del A2 2> /dev/null > ip netns add A1 > ip netns add A2 > > ip -n A1 link add _v0 type veth peer name _v1 netns A2 > ip -n A1 link set _v0 up > > ip -n A2 link add e00000 type bond > ip -n A2 link add lo0 type dummy > ip -n A2 link add vrf_1 type vrf table 10001 > ip -n A2 link set vrf_1 up > ip -n A2 link set e00000 master vrf_1 > > ip -n A2 addr add 1.1.1.1/24 dev e00000 > ip -n A2 link set e00000 up > ip -n A2 link set _v1 master e00000 > ip -n A2 link set _v1 up > ip -n A2 link set lo0 up > ip -n A2 addr add 2.2.2.2/32 dev lo0 > > ip -n A2 neigh add 1.1.1.10 lladdr 77:77:77:77:77:77 dev e00000 > ip -n A2 route add 3.3.3.3/32 via 1.1.1.10 dev e00000 table 10001 > > ip netns exec A2 iptables -t nat -A POSTROUTING -p udp -m udp --dport 53 -j \ > SNAT --to-source 2.2.2.2 -o vrf_1 > > sleep 5 > ip netns exec A2 python3 run_in_A2.py > x & > XPID=$! > sleep 5 > > IFACE_MAC=`sed -n 1p x` > SOCKET_PORT=`sed -n 2p x` > V1_MAC=`ip -n A2 link show _v1 | sed -n 2p | awk '{print $2'}` > ip netns exec A1 python3 run_in_A1.py -iface_mac ${IFACE_MAC} -socket_port \ > ${SOCKET_PORT} -v1_mac ${SOCKET_PORT} > sleep 5 > > kill -9 $XPID > wait $XPID 2> /dev/null > ip netns del A1 > ip netns del A2 > tail x -n 2 > rm x > set +x > > Signed-off-by: Lahav Schlesinger <lschlesinger@drivenets.com> > --- > drivers/net/vrf.c | 4 ++++ > 1 file changed, 4 insertions(+) > Thanks for the detailed explanation and use case. Looks correct to me. Reviewed-by: David Ahern <dsahern@kernel.org>
On Mon, 16 Aug 2021 12:08:00 -0600 David Ahern wrote: > On 8/15/21 6:00 AM, Lahav Schlesinger wrote: > > To fix the "reverse-NAT" for replies. > > Thanks for the detailed explanation and use case. > > Looks correct to me. > Reviewed-by: David Ahern <dsahern@kernel.org> I get a sense this is a fix. Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device") ? Or maybe it fixes commit a0f37efa8225 ("net: vrf: Fix NAT within a VRF")?
On 8/16/21 3:41 PM, Jakub Kicinski wrote: > On Mon, 16 Aug 2021 12:08:00 -0600 David Ahern wrote: >> On 8/15/21 6:00 AM, Lahav Schlesinger wrote: >>> To fix the "reverse-NAT" for replies. >> >> Thanks for the detailed explanation and use case. >> >> Looks correct to me. >> Reviewed-by: David Ahern <dsahern@kernel.org> > > I get a sense this is a fix. It is. I missed the lack of Fixes tag. > > Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device") > ? This one. Thanks, > > Or maybe it fixes commit a0f37efa8225 ("net: vrf: Fix NAT within a > VRF")? >
On Mon, 16 Aug 2021 17:28:49 -0600 David Ahern wrote: > >> Looks correct to me. > >> Reviewed-by: David Ahern <dsahern@kernel.org> > > Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device") > > ? > > This one. > > > Or maybe it fixes commit a0f37efa8225 ("net: vrf: Fix NAT within a > > VRF")? Applied, thanks!
diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index 2b1b944d4b28..8bbe2a7bb141 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -1367,6 +1367,8 @@ static struct sk_buff *vrf_ip6_rcv(struct net_device *vrf_dev, bool need_strict = rt6_need_strict(&ipv6_hdr(skb)->daddr); bool is_ndisc = ipv6_ndisc_frame(skb); + nf_reset_ct(skb); + /* loopback, multicast & non-ND link-local traffic; do not push through * packet taps again. Reset pkt_type for upper layers to process skb. * For strict packets with a source LLA, determine the dst using the @@ -1429,6 +1431,8 @@ static struct sk_buff *vrf_ip_rcv(struct net_device *vrf_dev, skb->skb_iif = vrf_dev->ifindex; IPCB(skb)->flags |= IPSKB_L3SLAVE; + nf_reset_ct(skb); + if (ipv4_is_multicast(ip_hdr(skb)->daddr)) goto out;