Message ID | 20201027105117.23052-5-tobias@waldekranz.com |
---|---|
State | New |
Headers | show |
Series | net: dsa: link aggregation support | expand |
On Tue, Oct 27, 2020 at 11:51:17AM +0100, Tobias Waldekranz wrote: > Packets ingressing on a LAG that egress on the CPU port, which are not > classified as management, will have a FORWARD tag that does not > contain the normal source device/port tuple. Instead the trunk bit > will be set, and the port field holds the LAG id. > > Since the exact source port information is not available in the tag, > frames are injected directly on the LAG interface and thus do never > pass through any DSA port interface on ingress. > > Management frames (TO_CPU) are not affected and will pass through the > DSA port interface as usual. > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > --- > net/dsa/dsa.c | 23 +++++++++++++---------- > net/dsa/tag_edsa.c | 12 +++++++++++- > 2 files changed, 24 insertions(+), 11 deletions(-) > > diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c > index 2131bf2b3a67..b84e5f0be049 100644 > --- a/net/dsa/dsa.c > +++ b/net/dsa/dsa.c > @@ -220,7 +220,6 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev, > } > > skb = nskb; > - p = netdev_priv(skb->dev); > skb_push(skb, ETH_HLEN); > skb->pkt_type = PACKET_HOST; > skb->protocol = eth_type_trans(skb, skb->dev); > @@ -234,17 +233,21 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev, > skb = nskb; > } > > - s = this_cpu_ptr(p->stats64); > - u64_stats_update_begin(&s->syncp); > - s->rx_packets++; > - s->rx_bytes += skb->len; > - u64_stats_update_end(&s->syncp); > + if (dsa_slave_dev_check(skb->dev)) { > + p = netdev_priv(skb->dev); > + s = this_cpu_ptr(p->stats64); > + u64_stats_update_begin(&s->syncp); > + s->rx_packets++; > + s->rx_bytes += skb->len; > + u64_stats_update_end(&s->syncp); > > - if (dsa_skb_defer_rx_timestamp(p, skb)) > - return 0; > - > - gro_cells_receive(&p->gcells, skb); > + if (dsa_skb_defer_rx_timestamp(p, skb)) > + return 0; > > + gro_cells_receive(&p->gcells, skb); > + } else { > + netif_rx(skb); > + } > return 0; > } > On one hand, I feel pretty yucky about this change. On the other hand, I wonder if it might be useful under some conditions for drivers with DSA_TAG_PROTO_NONE? For example, once the user bridges all slave interfaces, then that bridge will start being able to send and receive traffic, despite none of the individual switch ports being able to do that. Then, you could even go off and bridge a "foreign" interface, and that would again work properly. That use case is not supported today, but is very useful. Thoughts?
On Wed Oct 28, 2020 at 3:05 PM CET, Vladimir Oltean wrote: > On one hand, I feel pretty yucky about this change. > On the other hand, I wonder if it might be useful under some conditions > for drivers with DSA_TAG_PROTO_NONE? For example, once the user bridges > all slave interfaces, then that bridge will start being able to send and > receive traffic, despite none of the individual switch ports being able > to do that. Then, you could even go off and bridge a "foreign" > interface, > and that would again work properly. That use case is not supported > today, > but is very useful. > > Thoughts? In a scenario like the one you describe, are you saying that you would set skb->dev to the bridge's netdev in the rcv op? On ingress I think that would work. On egress I guess you would be getting duplicates for all multi-destination packets as there is no way for the none-tagger to limit it, right? (Unless you have the awesome tx-offloading we talked about yesterday of course :)) I think bridging to "foreign" interfaces still won't work, because on ingress the packet would never be caught by the bridge's rx handler. In order for something to be received by br_input.c, it has to pass through an interface that is attached to it, no? Everything above the bridge (like VLAN interfaces) should still work though.
Adding Ido here, he has some more experience with the do's and dont's here, and therefore might have one or two ideas to share. On Wed, Oct 28, 2020 at 04:28:23PM +0100, Tobias Waldekranz wrote: > On Wed Oct 28, 2020 at 3:05 PM CET, Vladimir Oltean wrote: > > On one hand, I feel pretty yucky about this change. > > On the other hand, I wonder if it might be useful under some conditions > > for drivers with DSA_TAG_PROTO_NONE? For example, once the user bridges > > all slave interfaces, then that bridge will start being able to send and > > receive traffic, despite none of the individual switch ports being able > > to do that. Then, you could even go off and bridge a "foreign" > > interface, > > and that would again work properly. That use case is not supported > > today, > > but is very useful. > > > > Thoughts? > > In a scenario like the one you describe, are you saying that you would > set skb->dev to the bridge's netdev in the rcv op? > > On ingress I think that would work. On egress I guess you would be > getting duplicates for all multi-destination packets as there is no > way for the none-tagger to limit it, right? (Unless you have the > awesome tx-offloading we talked about yesterday of course :)) > > I think bridging to "foreign" interfaces still won't work, because on > ingress the packet would never be caught by the bridge's rx > handler. In order for something to be received by br_input.c, it has > to pass through an interface that is attached to it, no? Everything > above the bridge (like VLAN interfaces) should still work though. Yes, I expect that the bridge input would need to have one more entry path into it than just br_handle_frame. I'm a bit confused and undecided right now, so let's look at it from a different perspective. Let's imagine a switchdev driver (DSA or not) which is able to offload IP forwarding. There are some interfaces that are bridged and one that is standalone. The setup looks as below. IP interfaces +---------------------------------------------------------+ | br0 | +---------------------------------------------------------+ +------------+ +------------+ +------------+ +------------+ +------------+ | swp0 | | swp1 | | swp2 | | swp3 | | eth0 | +------------+ +------------+ +------------+ +------------+ +------------+ Hardware interfaces +------------+ +------------+ +------------+ +------------+ +------------+ | DSA port 0 | | DSA port 1 | | DSA port 2 | | DSA port 3 | | e1000 | +------------+ +------------+ +------------+ +------------+ +------------+ Let's say you receive a packet on the standalone swp0, and you need to perform IP routing towards the bridged domain br0. Some switchdev/DSA ports are bridged and some aren't. The switchdev/DSA switch will attempt to do the IP routing step first, and it _can_ do that because it is aware of the br0 interface, so it will decrement the TTL and replace the L2 header. At this stage we have a modified IP packet, which corresponds with what should be injected into the hardware's view of the br0 interface. The packet is still in the switchdev/DSA hardware data path. But then, the switchdev/DSA hardware will look up the FDB in the name of br0, in an attempt of finding the destination port for the packet. But the packet should be delivered to a station connected to eth0 (e1000, foreign interface). So that's part of the exception path, the packet should be delivered to the CPU. But the packet was already modified by the hardware data path (IP forwarding has already taken place)! So how should the DSA/switchdev hardware deliver the packet to the CPU? It has 2 options: (a) unwind the entire packet modification, cancel the IP forwarding and deliver the unmodified packet to the CPU on behalf of swp0, the ingress port. Then let software IP forwarding plus software bridging deal with it, so that it can reach the e1000. (b) deliver the packet to the CPU in the middle of the hardware forwarding data path, where the exception/miss occurred, aka deliver it on behalf of br0. Modified by IP forwarding. This is where we'd have to manually inject skb->dev into br0 somehow. Maybe this sounds a bit crazy, considering that we don't have IP forwarding hardware with DSA today, and I am not exactly sure how other switchdev drivers deal with this exception path today. But nonetheless, it's almost impossible for DSA switches with IP forwarding abilities to never come up some day, so we ought to have our mind set about how the RX data path should like, and whether injecting directly into an upper is icky or a fact of life. Things get even more interesting when this is a cascaded DSA setup, and the bridging and routing are cross-chip. There, the FIB/FDB of 2 there isn't really any working around the problem that the packet might need to be delivered to the CPU somewhere in the middle of the data path, and it would need to be injected into the RX path of an upper interface in that case. What do you think?
On Wed Oct 28, 2020 at 9:18 PM CET, Vladimir Oltean wrote: > Let's say you receive a packet on the standalone swp0, and you need to > perform IP routing towards the bridged domain br0. Some switchdev/DSA > ports are bridged and some aren't. > > The switchdev/DSA switch will attempt to do the IP routing step first, > and it _can_ do that because it is aware of the br0 interface, so it > will decrement the TTL and replace the L2 header. > > At this stage we have a modified IP packet, which corresponds with what > should be injected into the hardware's view of the br0 interface. The > packet is still in the switchdev/DSA hardware data path. > > But then, the switchdev/DSA hardware will look up the FDB in the name of > br0, in an attempt of finding the destination port for the packet. But > the packet should be delivered to a station connected to eth0 (e1000, > foreign interface). So that's part of the exception path, the packet > should be delivered to the CPU. > > But the packet was already modified by the hardware data path (IP > forwarding has already taken place)! So how should the DSA/switchdev > hardware deliver the packet to the CPU? It has 2 options: > > (a) unwind the entire packet modification, cancel the IP forwarding and > deliver the unmodified packet to the CPU on behalf of swp0, the > ingress port. Then let software IP forwarding plus software bridging > deal with it, so that it can reach the e1000. > (b) deliver the packet to the CPU in the middle of the hardware > forwarding data path, where the exception/miss occurred, aka deliver > it on behalf of br0. Modified by IP forwarding. This is where we'd > have to manually inject skb->dev into br0 somehow. The thing is, unlike L2 where the hardware will add new neighbors to its FDB autonomously, every entry in the hardware FIB is under the strict control of the CPU. So I think you can avoid much of this headache simply by determining if a given L3 nexthop/neighbor is "foreign" to the switch or not, and then just skip offloading for those entries. You miss out on the hardware acceleration of replacing the L2 header of course. But my guess would be that once you have payed the tax of receiving the buffer via the NIC driver, allocated an skb, and called netif_rx() etc. the routing operation will be a rounding error. At least on smaller devices where the FIB is typically quite small. > Maybe this sounds a bit crazy, considering that we don't have IP > forwarding hardware with DSA today, and I am not exactly sure how other > switchdev drivers deal with this exception path today. But nonetheless, > it's almost impossible for DSA switches with IP forwarding abilities to > never come up some day, so we ought to have our mind set about how the > RX data path should like, and whether injecting directly into an upper > is icky or a fact of life. Not crazy at all. In fact the Amethyst (6393X), for which there is a patchset available on netdev, is capable of doing this (the hardware is - the posted patches do not implement it). > Things get even more interesting when this is a cascaded DSA setup, and > the bridging and routing are cross-chip. There, the FIB/FDB of 2 there > isn't really any working around the problem that the packet might need > to be delivered to the CPU somewhere in the middle of the data path, and > it would need to be injected into the RX path of an upper interface in > that case. > > What do you think?
On Wed, Oct 28, 2020 at 11:31:58PM +0100, Tobias Waldekranz wrote: > The thing is, unlike L2 where the hardware will add new neighbors to > its FDB autonomously, every entry in the hardware FIB is under the > strict control of the CPU. So I think you can avoid much of this > headache simply by determining if a given L3 nexthop/neighbor is > "foreign" to the switch or not, and then just skip offloading for > those entries. > > You miss out on the hardware acceleration of replacing the L2 header > of course. But my guess would be that once you have payed the tax of > receiving the buffer via the NIC driver, allocated an skb, and called > netif_rx() etc. the routing operation will be a rounding error. At > least on smaller devices where the FIB is typically quite small. Right, but in that case, there is less of an argument to have something like DSA injecting directly into an upper device's RX path, if only mv88e6xxx with bonding is ever going to use that.
On Thu Oct 29, 2020 at 2:08 AM CET, Vladimir Oltean wrote: > On Wed, Oct 28, 2020 at 11:31:58PM +0100, Tobias Waldekranz wrote: > > The thing is, unlike L2 where the hardware will add new neighbors to > > its FDB autonomously, every entry in the hardware FIB is under the > > strict control of the CPU. So I think you can avoid much of this > > headache simply by determining if a given L3 nexthop/neighbor is > > "foreign" to the switch or not, and then just skip offloading for > > those entries. > > > > You miss out on the hardware acceleration of replacing the L2 header > > of course. But my guess would be that once you have payed the tax of > > receiving the buffer via the NIC driver, allocated an skb, and called > > netif_rx() etc. the routing operation will be a rounding error. At > > least on smaller devices where the FIB is typically quite small. > > Right, but in that case, there is less of an argument to have something > like DSA injecting directly into an upper device's RX path, if only > mv88e6xxx with bonding is ever going to use that. Doesn't that basically boil down to the argument that "we can't merge this change because it's never going to be used, except for when it is used"? I don't know if I buy that. How about the inverse question: If this change is not acceptable, do you have any other suggestion on to solve it? The hardware is what it is, I can not will the source port information into existence, and injecting packets on the wrong DSA port feels even more dirty to me.
On Thu, Oct 29, 2020 at 08:47:17AM +0100, Tobias Waldekranz wrote: > Doesn't that basically boil down to the argument that "we can't merge > this change because it's never going to be used, except for when it is > used"? I don't know if I buy that. > > How about the inverse question: If this change is not acceptable, do > you have any other suggestion on to solve it? The hardware is what it > is, I can not will the source port information into existence, and > injecting packets on the wrong DSA port feels even more dirty to me. I suppose you're right, I don't have any better suggestion.
On Wed, Oct 28, 2020 at 08:18:24PM +0200, Vladimir Oltean wrote: > Yes, I expect that the bridge input would need to have one more entry > path into it than just br_handle_frame. > > I'm a bit confused and undecided right now, so let's look at it from a > different perspective. Let's imagine a switchdev driver (DSA or not) > which is able to offload IP forwarding. There are some interfaces that > are bridged and one that is standalone. The setup looks as below. > > IP interfaces > +---------------------------------------------------------+ > | br0 | > +---------------------------------------------------------+ > > +------------+ +------------+ +------------+ +------------+ +------------+ > | swp0 | | swp1 | | swp2 | | swp3 | | eth0 | > +------------+ +------------+ +------------+ +------------+ +------------+ > > Hardware interfaces > > +------------+ +------------+ +------------+ +------------+ +------------+ > | DSA port 0 | | DSA port 1 | | DSA port 2 | | DSA port 3 | | e1000 | > +------------+ +------------+ +------------+ +------------+ +------------+ > > Let's say you receive a packet on the standalone swp0, and you need to > perform IP routing towards the bridged domain br0. Some switchdev/DSA > ports are bridged and some aren't. > > The switchdev/DSA switch will attempt to do the IP routing step first, > and it _can_ do that because it is aware of the br0 interface, so it > will decrement the TTL and replace the L2 header. > > At this stage we have a modified IP packet, which corresponds with what > should be injected into the hardware's view of the br0 interface. The > packet is still in the switchdev/DSA hardware data path. > > But then, the switchdev/DSA hardware will look up the FDB in the name of > br0, in an attempt of finding the destination port for the packet. But > the packet should be delivered to a station connected to eth0 (e1000, > foreign interface). So that's part of the exception path, the packet > should be delivered to the CPU. > > But the packet was already modified by the hardware data path (IP > forwarding has already taken place)! So how should the DSA/switchdev > hardware deliver the packet to the CPU? It has 2 options: > > (a) unwind the entire packet modification, cancel the IP forwarding and > deliver the unmodified packet to the CPU on behalf of swp0, the > ingress port. Then let software IP forwarding plus software bridging > deal with it, so that it can reach the e1000. This is what happens in the Spectrum ASICs. If a packet hits some exception in the data path, it is trapped from the Rx port unmodified. > (b) deliver the packet to the CPU in the middle of the hardware > forwarding data path, where the exception/miss occurred, aka deliver > it on behalf of br0. Modified by IP forwarding. This is where we'd > have to manually inject skb->dev into br0 somehow.
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 2131bf2b3a67..b84e5f0be049 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -220,7 +220,6 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev, } skb = nskb; - p = netdev_priv(skb->dev); skb_push(skb, ETH_HLEN); skb->pkt_type = PACKET_HOST; skb->protocol = eth_type_trans(skb, skb->dev); @@ -234,17 +233,21 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev, skb = nskb; } - s = this_cpu_ptr(p->stats64); - u64_stats_update_begin(&s->syncp); - s->rx_packets++; - s->rx_bytes += skb->len; - u64_stats_update_end(&s->syncp); + if (dsa_slave_dev_check(skb->dev)) { + p = netdev_priv(skb->dev); + s = this_cpu_ptr(p->stats64); + u64_stats_update_begin(&s->syncp); + s->rx_packets++; + s->rx_bytes += skb->len; + u64_stats_update_end(&s->syncp); - if (dsa_skb_defer_rx_timestamp(p, skb)) - return 0; - - gro_cells_receive(&p->gcells, skb); + if (dsa_skb_defer_rx_timestamp(p, skb)) + return 0; + gro_cells_receive(&p->gcells, skb); + } else { + netif_rx(skb); + } return 0; } diff --git a/net/dsa/tag_edsa.c b/net/dsa/tag_edsa.c index 120614240319..800b02f04394 100644 --- a/net/dsa/tag_edsa.c +++ b/net/dsa/tag_edsa.c @@ -86,6 +86,7 @@ static struct sk_buff *edsa_xmit(struct sk_buff *skb, struct net_device *dev) static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt) { + bool trunk = false; u8 *edsa_header; int frame_type; int code; @@ -120,6 +121,7 @@ static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev, break; case FRAME_TYPE_FORWARD: + trunk = !!(edsa_header[1] & 7); skb->offload_fwd_mark = 1; break; @@ -133,7 +135,15 @@ static struct sk_buff *edsa_rcv(struct sk_buff *skb, struct net_device *dev, source_device = edsa_header[0] & 0x1f; source_port = (edsa_header[1] >> 3) & 0x1f; - skb->dev = dsa_master_find_slave(dev, source_device, source_port); + if (trunk) { + struct dsa_port *cpu_dp = dev->dsa_ptr; + + skb->dev = dsa_lag_dev_by_id(cpu_dp->dst, source_port); + } else { + skb->dev = dsa_master_find_slave(dev, source_device, + source_port); + } + if (!skb->dev) return NULL;
Packets ingressing on a LAG that egress on the CPU port, which are not classified as management, will have a FORWARD tag that does not contain the normal source device/port tuple. Instead the trunk bit will be set, and the port field holds the LAG id. Since the exact source port information is not available in the tag, frames are injected directly on the LAG interface and thus do never pass through any DSA port interface on ingress. Management frames (TO_CPU) are not affected and will pass through the DSA port interface as usual. Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> --- net/dsa/dsa.c | 23 +++++++++++++---------- net/dsa/tag_edsa.c | 12 +++++++++++- 2 files changed, 24 insertions(+), 11 deletions(-)