[RFC,v2,0/3] l3mdev icmp error route lookup fixes

Message ID	20200918181801.2571-1-mathieu.desnoyers@efficios.com
Headers	show Return-Path: <SRS0=Abal=C3=vger.kernel.org=netdev-owner@kernel.org> From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> To: David Ahern <dsahern@kernel.org>, "David S . Miller" <davem@davemloft.net>, netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Subject: [RFC PATCH v2 0/3] l3mdev icmp error route lookup fixes Date: Fri, 18 Sep 2020 14:17:58 -0400 Message-Id: <20200918181801.2571-1-mathieu.desnoyers@efficios.com> Precedence: bulk
Series	l3mdev icmp error route lookup fixes \| expand [RFC,v2,0/3] l3mdev icmp error route lookup fixes [RFC,v2,2/3] ipv6/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2)

Mathieu Desnoyers Sept. 18, 2020, 6:17 p.m. UTC

Hi,

Here is an updated series of fixes for ipv4 and ipv6 which which ensure
the route lookup is performed on the right routing table in VRF
configurations when sending TTL expired icmp errors (useful for
traceroute).

It includes tests for both ipv4 and ipv6.

These fixes address specifically address the code paths involved in
sending TTL expired icmp errors. As detailed in the individual commit
messages, those fixes do not address similar issues related to network
namespaces and unreachable / fragmentation needed messages, which appear
to use different code paths.

Thanks,

Mathieu


Mathieu Desnoyers (2):
  ipv4/icmp: l3mdev: Perform icmp error route lookup on source device
    routing table (v2)
  ipv6/icmp: l3mdev: Perform icmp error route lookup on source device
    routing table (v2)

Michael Jeanson (1):
  selftests: Add VRF icmp error route lookup test

 net/ipv4/icmp.c                               |  23 +-
 net/ipv6/icmp.c                               |   7 +-
 net/ipv6/ip6_output.c                         |   2 -
 tools/testing/selftests/net/Makefile          |   1 +
 .../selftests/net/vrf_icmp_error_route.sh     | 475 ++++++++++++++++++
 5 files changed, 502 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/net/vrf_icmp_error_route.sh

David Ahern Sept. 21, 2020, 6:36 p.m. UTC | #1

On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
> Hi,

> 

> Here is an updated series of fixes for ipv4 and ipv6 which which ensure

> the route lookup is performed on the right routing table in VRF

> configurations when sending TTL expired icmp errors (useful for

> traceroute).

> 

> It includes tests for both ipv4 and ipv6.

> 

> These fixes address specifically address the code paths involved in

> sending TTL expired icmp errors. As detailed in the individual commit

> messages, those fixes do not address similar issues related to network

> namespaces and unreachable / fragmentation needed messages, which appear

> to use different code paths.

> 


New selftests are failing:
TEST: Ping received ICMP frag needed                       [FAIL]

Both IPv4 and IPv6 versions are failing.

Mathieu Desnoyers Sept. 21, 2020, 6:44 p.m. UTC | #2

----- On Sep 21, 2020, at 2:36 PM, David Ahern dsahern@gmail.com wrote:

> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
>> Hi,
>> 
>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure
>> the route lookup is performed on the right routing table in VRF
>> configurations when sending TTL expired icmp errors (useful for
>> traceroute).
>> 
>> It includes tests for both ipv4 and ipv6.
>> 
>> These fixes address specifically address the code paths involved in
>> sending TTL expired icmp errors. As detailed in the individual commit
>> messages, those fixes do not address similar issues related to network
>> namespaces and unreachable / fragmentation needed messages, which appear
>> to use different code paths.
>> 
> 
> New selftests are failing:
> TEST: Ping received ICMP frag needed                       [FAIL]
> 
> Both IPv4 and IPv6 versions are failing.

Indeed, this situation is discussed in each patch commit message:

ipv4:

[ It has also been pointed out that a similar issue exists with
  unreachable / fragmentation needed messages, which can be triggered by
  changing the MTU of eth1 in r1 to 1400 and running:

  ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

  Some investigation points to raw_icmp_error() and raw_err() as being
  involved in this last scenario. The focus of this patch is TTL expired
  ICMP messages, which go through icmp_route_lookup.
  Investigation of failure modes related to raw_icmp_error() is beyond
  this investigation's scope. ]

ipv6:

[ Testing shows that similar issues exist with ipv6 unreachable /
  fragmentation needed messages.  However, investigation of this
  additional failure mode is beyond this investigation's scope. ]

I do not have the time to investigate further unfortunately, so I
thought it best to post what I have.

Note that network namespaces also probably have the same problem,
but those are not covered by the test cases.

Thanks,

Mathieu

David Ahern Sept. 21, 2020, 7:11 p.m. UTC | #3

On 9/21/20 12:44 PM, Mathieu Desnoyers wrote:
> ----- On Sep 21, 2020, at 2:36 PM, David Ahern dsahern@gmail.com wrote:

> 

>> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:

>>> Hi,

>>>

>>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure

>>> the route lookup is performed on the right routing table in VRF

>>> configurations when sending TTL expired icmp errors (useful for

>>> traceroute).

>>>

>>> It includes tests for both ipv4 and ipv6.

>>>

>>> These fixes address specifically address the code paths involved in

>>> sending TTL expired icmp errors. As detailed in the individual commit

>>> messages, those fixes do not address similar issues related to network

>>> namespaces and unreachable / fragmentation needed messages, which appear

>>> to use different code paths.

>>>

>>

>> New selftests are failing:

>> TEST: Ping received ICMP frag needed                       [FAIL]

>>

>> Both IPv4 and IPv6 versions are failing.

> 

> Indeed, this situation is discussed in each patch commit message:

> 

> ipv4:

> 

> [ It has also been pointed out that a similar issue exists with

>   unreachable / fragmentation needed messages, which can be triggered by

>   changing the MTU of eth1 in r1 to 1400 and running:

> 

>   ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

> 

>   Some investigation points to raw_icmp_error() and raw_err() as being

>   involved in this last scenario. The focus of this patch is TTL expired

>   ICMP messages, which go through icmp_route_lookup.

>   Investigation of failure modes related to raw_icmp_error() is beyond

>   this investigation's scope. ]

> 

> ipv6:

> 

> [ Testing shows that similar issues exist with ipv6 unreachable /

>   fragmentation needed messages.  However, investigation of this

>   additional failure mode is beyond this investigation's scope. ]

> 

> I do not have the time to investigate further unfortunately, so I

> thought it best to post what I have.

> 


the test setup is bad. You have r1 dropping the MTU in VRF red, but not
telling VRF red how to send back the ICMP. e.g., for IPv4 add:

    ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue

do the same for v6.

Also, I do not see a reason for r2; I suggest dropping it. What you are
testing is icmp crossing VRF with route leaking, so there should not be
a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
the return via r2).

Mathieu Desnoyers Sept. 21, 2020, 7:33 p.m. UTC | #4

----- On Sep 21, 2020, at 3:11 PM, David Ahern dsahern@gmail.com wrote:

> On 9/21/20 12:44 PM, Mathieu Desnoyers wrote:

>> ----- On Sep 21, 2020, at 2:36 PM, David Ahern dsahern@gmail.com wrote:

>> 

>>> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:

>>>> Hi,

>>>>

>>>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure

>>>> the route lookup is performed on the right routing table in VRF

>>>> configurations when sending TTL expired icmp errors (useful for

>>>> traceroute).

>>>>

>>>> It includes tests for both ipv4 and ipv6.

>>>>

>>>> These fixes address specifically address the code paths involved in

>>>> sending TTL expired icmp errors. As detailed in the individual commit

>>>> messages, those fixes do not address similar issues related to network

>>>> namespaces and unreachable / fragmentation needed messages, which appear

>>>> to use different code paths.

>>>>

>>>

>>> New selftests are failing:

>>> TEST: Ping received ICMP frag needed                       [FAIL]

>>>

>>> Both IPv4 and IPv6 versions are failing.

>> 

>> Indeed, this situation is discussed in each patch commit message:

>> 

>> ipv4:

>> 

>> [ It has also been pointed out that a similar issue exists with

>>   unreachable / fragmentation needed messages, which can be triggered by

>>   changing the MTU of eth1 in r1 to 1400 and running:

>> 

>>   ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

>> 

>>   Some investigation points to raw_icmp_error() and raw_err() as being

>>   involved in this last scenario. The focus of this patch is TTL expired

>>   ICMP messages, which go through icmp_route_lookup.

>>   Investigation of failure modes related to raw_icmp_error() is beyond

>>   this investigation's scope. ]

>> 

>> ipv6:

>> 

>> [ Testing shows that similar issues exist with ipv6 unreachable /

>>   fragmentation needed messages.  However, investigation of this

>>   additional failure mode is beyond this investigation's scope. ]

>> 

>> I do not have the time to investigate further unfortunately, so I

>> thought it best to post what I have.

>> 

> 

> the test setup is bad. You have r1 dropping the MTU in VRF red, but not

> telling VRF red how to send back the ICMP. e.g., for IPv4 add:

> 

>    ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue

> 

> do the same for v6.

> 

> Also, I do not see a reason for r2; I suggest dropping it. What you are

> testing is icmp crossing VRF with route leaking, so there should not be

> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and

> the return via r2).


CCing Michael Jeanson, author of the selftests patch.

Thanks for your feedback,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Michael Jeanson Sept. 22, 2020, 1:52 p.m. UTC | #5

----- On 21 Sep, 2020, at 15:33, Mathieu Desnoyers mathieu.desnoyers@efficios.com wrote:

> ----- On Sep 21, 2020, at 3:11 PM, David Ahern dsahern@gmail.com wrote:
> 
>> On 9/21/20 12:44 PM, Mathieu Desnoyers wrote:
>>> ----- On Sep 21, 2020, at 2:36 PM, David Ahern dsahern@gmail.com wrote:
>>> 
>>>> On 9/18/20 12:17 PM, Mathieu Desnoyers wrote:
>>>>> Hi,
>>>>>
>>>>> Here is an updated series of fixes for ipv4 and ipv6 which which ensure
>>>>> the route lookup is performed on the right routing table in VRF
>>>>> configurations when sending TTL expired icmp errors (useful for
>>>>> traceroute).
>>>>>
>>>>> It includes tests for both ipv4 and ipv6.
>>>>>
>>>>> These fixes address specifically address the code paths involved in
>>>>> sending TTL expired icmp errors. As detailed in the individual commit
>>>>> messages, those fixes do not address similar issues related to network
>>>>> namespaces and unreachable / fragmentation needed messages, which appear
>>>>> to use different code paths.
>>>>>
>>>>
>>>> New selftests are failing:
>>>> TEST: Ping received ICMP frag needed                       [FAIL]
>>>>
>>>> Both IPv4 and IPv6 versions are failing.
>>> 
>>> Indeed, this situation is discussed in each patch commit message:
>>> 
>>> ipv4:
>>> 
>>> [ It has also been pointed out that a similar issue exists with
>>>   unreachable / fragmentation needed messages, which can be triggered by
>>>   changing the MTU of eth1 in r1 to 1400 and running:
>>> 
>>>   ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2
>>> 
>>>   Some investigation points to raw_icmp_error() and raw_err() as being
>>>   involved in this last scenario. The focus of this patch is TTL expired
>>>   ICMP messages, which go through icmp_route_lookup.
>>>   Investigation of failure modes related to raw_icmp_error() is beyond
>>>   this investigation's scope. ]
>>> 
>>> ipv6:
>>> 
>>> [ Testing shows that similar issues exist with ipv6 unreachable /
>>>   fragmentation needed messages.  However, investigation of this
>>>   additional failure mode is beyond this investigation's scope. ]
>>> 
>>> I do not have the time to investigate further unfortunately, so I
>>> thought it best to post what I have.
>>> 
>> 
>> the test setup is bad. You have r1 dropping the MTU in VRF red, but not
>> telling VRF red how to send back the ICMP. e.g., for IPv4 add:
>> 
>>    ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue
>> 
>> do the same for v6.
>> 
>> Also, I do not see a reason for r2; I suggest dropping it. What you are
>> testing is icmp crossing VRF with route leaking, so there should not be
>> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
>> the return via r2).

The objective of the test was to replicate a clients environment where
packets are crossing from a VRF which has a route back to the source to
one which doesn't while reaching a ttl of 0. If the route lookup for the
icmp error is done on the interface in the first VRF, it can be routed to
the source but not on the interface in the second VRF which is the
current behaviour for icmp errors generated while crossing between VRFs.

There may be a better test case that doesn't involve asymmetric routing
to test this but it's the only way I found to replicate this.

David Ahern Sept. 23, 2020, 1:59 a.m. UTC | #6

On 9/22/20 7:52 AM, Michael Jeanson wrote:
>>>
>>> the test setup is bad. You have r1 dropping the MTU in VRF red, but not
>>> telling VRF red how to send back the ICMP. e.g., for IPv4 add:
>>>
>>>    ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue
>>>
>>> do the same for v6.
>>>
>>> Also, I do not see a reason for r2; I suggest dropping it. What you are
>>> testing is icmp crossing VRF with route leaking, so there should not be
>>> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and
>>> the return via r2).
> 
> The objective of the test was to replicate a clients environment where
> packets are crossing from a VRF which has a route back to the source to
> one which doesn't while reaching a ttl of 0. If the route lookup for the
> icmp error is done on the interface in the first VRF, it can be routed to
> the source but not on the interface in the second VRF which is the
> current behaviour for icmp errors generated while crossing between VRFs.
> 
> There may be a better test case that doesn't involve asymmetric routing
> to test this but it's the only way I found to replicate this.
> 

It should work without asymmetric routing; adding the return route to
the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
should work for TTL as well.

Adding a second pass on the tests with the return through r2 is fine,
but add a first pass for the more typical case.

Michael Jeanson Sept. 23, 2020, 4:04 p.m. UTC | #7

On 2020-09-22 21 h 59, David Ahern wrote:
> On 9/22/20 7:52 AM, Michael Jeanson wrote:

>>>>

>>>> the test setup is bad. You have r1 dropping the MTU in VRF red, but not

>>>> telling VRF red how to send back the ICMP. e.g., for IPv4 add:

>>>>

>>>>     ip -netns r1 ro add vrf red 172.16.1.0/24 dev blue

>>>>

>>>> do the same for v6.

>>>>

>>>> Also, I do not see a reason for r2; I suggest dropping it. What you are

>>>> testing is icmp crossing VRF with route leaking, so there should not be

>>>> a need for r2 which leads to asymmetrical routing (172.16.1.0 via r1 and

>>>> the return via r2).

>>

>> The objective of the test was to replicate a clients environment where

>> packets are crossing from a VRF which has a route back to the source to

>> one which doesn't while reaching a ttl of 0. If the route lookup for the

>> icmp error is done on the interface in the first VRF, it can be routed to

>> the source but not on the interface in the second VRF which is the

>> current behaviour for icmp errors generated while crossing between VRFs.

>>

>> There may be a better test case that doesn't involve asymmetric routing

>> to test this but it's the only way I found to replicate this.

>>

> 

> It should work without asymmetric routing; adding the return route to

> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It

> should work for TTL as well.

> 

> Adding a second pass on the tests with the return through r2 is fine,

> but add a first pass for the more typical case.

Hi,

Before writing new tests I just want to make sure we are trying to fix 
the same issue. If I add a return route to the red VRF then we don't
need this patchset because whether the ICMP error are routed using the
table from the source or destination interface they will reach the 
source host.

The issue for which this patchset was sent only happens when the 
destination interface's VRF doesn't have a route back to the source 
host. I guess we might question if this is actually a bug or not.

So the question really is, when a packet is forwarded between VRFs 
through route leaking and an icmp error is generated, which table should 
be used for the route lookup? And does it depend on the type of icmp 
error? (e.g. TTL=1 happens before forwarding, but fragmentation needed 
happens after when on the destination interface)

Cheers,

Michael

Michael Jeanson Sept. 23, 2020, 5:03 p.m. UTC | #8

On 2020-09-23 12 h 04, Michael Jeanson wrote:
>> It should work without asymmetric routing; adding the return route to
>> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
>> should work for TTL as well.
>>
>> Adding a second pass on the tests with the return through r2 is fine,
>> but add a first pass for the more typical case.
> 
> Hi,
> 
> Before writing new tests I just want to make sure we are trying to fix 
> the same issue. If I add a return route to the red VRF then we don't
> need this patchset because whether the ICMP error are routed using the
> table from the source or destination interface they will reach the 
> source host.
> 
> The issue for which this patchset was sent only happens when the 
> destination interface's VRF doesn't have a route back to the source 
> host. I guess we might question if this is actually a bug or not.
> 
> So the question really is, when a packet is forwarded between VRFs 
> through route leaking and an icmp error is generated, which table should 
> be used for the route lookup? And does it depend on the type of icmp 
> error? (e.g. TTL=1 happens before forwarding, but fragmentation needed 
> happens after when on the destination interface)

As a side note, I don't mind reworking the tests as you requested even 
if the patchset as a whole ends up not being needed and if you think 
they are still useful. I just wanted to make sure we understood each other.

Cheers,

Michael

David Ahern Sept. 23, 2020, 6:46 p.m. UTC | #9

On 9/23/20 11:03 AM, Michael Jeanson wrote:
> On 2020-09-23 12 h 04, Michael Jeanson wrote:
>>> It should work without asymmetric routing; adding the return route to
>>> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It
>>> should work for TTL as well.
>>>
>>> Adding a second pass on the tests with the return through r2 is fine,
>>> but add a first pass for the more typical case.
>>
>> Hi,
>>
>> Before writing new tests I just want to make sure we are trying to fix
>> the same issue. If I add a return route to the red VRF then we don't
>> need this patchset because whether the ICMP error are routed using the
>> table from the source or destination interface they will reach the
>> source host.
>>
>> The issue for which this patchset was sent only happens when the
>> destination interface's VRF doesn't have a route back to the source
>> host. I guess we might question if this is actually a bug or not.
>>
>> So the question really is, when a packet is forwarded between VRFs
>> through route leaking and an icmp error is generated, which table
>> should be used for the route lookup? And does it depend on the type of
>> icmp error? (e.g. TTL=1 happens before forwarding, but fragmentation
>> needed happens after when on the destination interface)
> 
> As a side note, I don't mind reworking the tests as you requested even
> if the patchset as a whole ends up not being needed and if you think
> they are still useful. I just wanted to make sure we understood each other.
> 

if you are leaking from VRF 1 to VRF 2 and you do not configure VRF 2
with how to send to errors back to source - MTU or TTL - then I will
argue that is a configuration problem, not a bug.

Now the TTL problem is interesting. You need the FIB lookup to know that
the packet is forwarded, and by the time of the ttl check in ip_forward
skb->dev points to the ingress VRF and dst points to the egress VRF. So
I think the change is warranted.

Let's do this for the tests:
1 pass through all of the tests (TTL and MTU, v4 and v6) with symmetric
routing configured and make sure they all pass. ie., keep all of them
and make sure all tests pass. No sense losing the tests and the thoughts
behind them.

Add a second pass with the asymmetric routing per the customer setup
since it motivated the investigation.

Rename the test to something like vrf_route_leaking.sh. It can be
expanded with more tests related to route leaking as they come up.

Michael Jeanson Sept. 23, 2020, 7:12 p.m. UTC | #10

On 2020-09-23 14 h 46, David Ahern wrote:
> On 9/23/20 11:03 AM, Michael Jeanson wrote:

>> On 2020-09-23 12 h 04, Michael Jeanson wrote:

>>>> It should work without asymmetric routing; adding the return route to

>>>> the second vrf as I mentioned above fixes the FRAG_NEEDED problem. It

>>>> should work for TTL as well.

>>>>

>>>> Adding a second pass on the tests with the return through r2 is fine,

>>>> but add a first pass for the more typical case.

>>>

>>> Hi,

>>>

>>> Before writing new tests I just want to make sure we are trying to fix

>>> the same issue. If I add a return route to the red VRF then we don't

>>> need this patchset because whether the ICMP error are routed using the

>>> table from the source or destination interface they will reach the

>>> source host.

>>>

>>> The issue for which this patchset was sent only happens when the

>>> destination interface's VRF doesn't have a route back to the source

>>> host. I guess we might question if this is actually a bug or not.

>>>

>>> So the question really is, when a packet is forwarded between VRFs

>>> through route leaking and an icmp error is generated, which table

>>> should be used for the route lookup? And does it depend on the type of

>>> icmp error? (e.g. TTL=1 happens before forwarding, but fragmentation

>>> needed happens after when on the destination interface)

>>

>> As a side note, I don't mind reworking the tests as you requested even

>> if the patchset as a whole ends up not being needed and if you think

>> they are still useful. I just wanted to make sure we understood each other.

>>

> 

> if you are leaking from VRF 1 to VRF 2 and you do not configure VRF 2

> with how to send to errors back to source - MTU or TTL - then I will

> argue that is a configuration problem, not a bug.

> 

> Now the TTL problem is interesting. You need the FIB lookup to know that

> the packet is forwarded, and by the time of the ttl check in ip_forward

> skb->dev points to the ingress VRF and dst points to the egress VRF. So

> I think the change is warranted.

> 

> Let's do this for the tests:

> 1 pass through all of the tests (TTL and MTU, v4 and v6) with symmetric

> routing configured and make sure they all pass. ie., keep all of them

> and make sure all tests pass. No sense losing the tests and the thoughts

> behind them.

> 

> Add a second pass with the asymmetric routing per the customer setup

> since it motivated the investigation.

> 

> Rename the test to something like vrf_route_leaking.sh. It can be

> expanded with more tests related to route leaking as they come up.

> 


Just a final clarification, the asymmetric setup would have no return 
route in VRF 2 and only test the TTL case since the others would fail?

David Ahern Sept. 23, 2020, 8:04 p.m. UTC | #11

On 9/23/20 1:12 PM, Michael Jeanson wrote:
> 
> Just a final clarification, the asymmetric setup would have no return
> route in VRF 2 and only test the TTL case since the others would fail?

correct. add a statement about it representing a customer setup so it is
clear such a config is a 1-off

[RFC,v2,0/3] l3mdev icmp error route lookup fixes

Message

Comments