diff mbox series

[bpf] udp: fix a memory leak in udp_read_sock()

Message ID 20210517022322.50501-1-xiyou.wangcong@gmail.com
State Superseded
Headers show
Series [bpf] udp: fix a memory leak in udp_read_sock() | expand

Commit Message

Cong Wang May 17, 2021, 2:23 a.m. UTC
From: Cong Wang <cong.wang@bytedance.com>

sk_psock_verdict_recv() clones the skb and uses the clone
afterward, so udp_read_sock() should free the original skb after
done using it.

Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap")
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
---
 net/ipv4/udp.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

John Fastabend May 18, 2021, 5:36 a.m. UTC | #1
Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>

> 

> sk_psock_verdict_recv() clones the skb and uses the clone

> afterward, so udp_read_sock() should free the original skb after

> done using it.


The clone only happens if sk_psock_verdict_recv() returns >0.

> 

> Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap")

> Cc: John Fastabend <john.fastabend@gmail.com>

> Cc: Daniel Borkmann <daniel@iogearbox.net>

> Cc: Jakub Sitnicki <jakub@cloudflare.com>

> Cc: Lorenz Bauer <lmb@cloudflare.com>

> Signed-off-by: Cong Wang <cong.wang@bytedance.com>

> ---

>  net/ipv4/udp.c | 2 ++

>  1 file changed, 2 insertions(+)

> 

> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c

> index 15f5504adf5b..e31d67fd5183 100644

> --- a/net/ipv4/udp.c

> +++ b/net/ipv4/udp.c

> @@ -1798,11 +1798,13 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc,

>  		if (used <= 0) {

>  			if (!copied)

>  				copied = used;

> +			kfree_skb(skb);


This case is different from the TCP side, if there is an error
the sockmap side will also call kfree_skb(). In TCP side we peek
the skb because we don't want to drop it. On UDP side this will
just drop data on the floor. Its not super friendly, but its
UDP so we are making the assumption this is ok? We've tried
to remove all the drop data cases from TCP it would be nice
to not drop data on UDP side if we can help it. Could we
requeue or peek the UDP skb to avoid this?

>  			break;

>  		} else if (used <= skb->len) {

>  			copied += used;

>  		}

>  

> +		kfree_skb(skb);

>  		if (!desc->count)

>  			break;

>  	}

> -- 

> 2.25.1

>
Cong Wang May 18, 2021, 4:54 p.m. UTC | #2
On Mon, May 17, 2021 at 10:36 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>

> Cong Wang wrote:

> > From: Cong Wang <cong.wang@bytedance.com>

> >

> > sk_psock_verdict_recv() clones the skb and uses the clone

> > afterward, so udp_read_sock() should free the original skb after

> > done using it.

>

> The clone only happens if sk_psock_verdict_recv() returns >0.


Sure, in case of error, no one uses the original skb either,
so still need to free it.

>

> >

> > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap")

> > Cc: John Fastabend <john.fastabend@gmail.com>

> > Cc: Daniel Borkmann <daniel@iogearbox.net>

> > Cc: Jakub Sitnicki <jakub@cloudflare.com>

> > Cc: Lorenz Bauer <lmb@cloudflare.com>

> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>

> > ---

> >  net/ipv4/udp.c | 2 ++

> >  1 file changed, 2 insertions(+)

> >

> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c

> > index 15f5504adf5b..e31d67fd5183 100644

> > --- a/net/ipv4/udp.c

> > +++ b/net/ipv4/udp.c

> > @@ -1798,11 +1798,13 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc,

> >               if (used <= 0) {

> >                       if (!copied)

> >                               copied = used;

> > +                     kfree_skb(skb);

>

> This case is different from the TCP side, if there is an error

> the sockmap side will also call kfree_skb(). In TCP side we peek

> the skb because we don't want to drop it. On UDP side this will

> just drop data on the floor. Its not super friendly, but its

> UDP so we are making the assumption this is ok? We've tried

> to remove all the drop data cases from TCP it would be nice

> to not drop data on UDP side if we can help it. Could we

> requeue or peek the UDP skb to avoid this?


TCP is special because it supports splice() where we can
do a partial read, so it needs to peek the skb, right? UDP only
supports sockmap, where we always read a whole skb, so we
do not need to peek here?

Thanks.
John Fastabend May 18, 2021, 7:56 p.m. UTC | #3
Cong Wang wrote:
> On Mon, May 17, 2021 at 10:36 PM John Fastabend

> <john.fastabend@gmail.com> wrote:

> >

> > Cong Wang wrote:

> > > From: Cong Wang <cong.wang@bytedance.com>

> > >

> > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > afterward, so udp_read_sock() should free the original skb after

> > > done using it.

> >

> > The clone only happens if sk_psock_verdict_recv() returns >0.

> 

> Sure, in case of error, no one uses the original skb either,

> so still need to free it.


But the data is going to be dropped then. I'm questioning if this
is the best we can do or not. Its simplest sure, but could we
do a bit more work and peek those skbs or requeue them? Otherwise
if you cross memory limits for a bit your likely to drop these
unnecessarily.

> 

> >

> > >

> > > Fixes: d7f571188ecf ("udp: Implement ->read_sock() for sockmap")

> > > Cc: John Fastabend <john.fastabend@gmail.com>

> > > Cc: Daniel Borkmann <daniel@iogearbox.net>

> > > Cc: Jakub Sitnicki <jakub@cloudflare.com>

> > > Cc: Lorenz Bauer <lmb@cloudflare.com>

> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com>

> > > ---

> > >  net/ipv4/udp.c | 2 ++

> > >  1 file changed, 2 insertions(+)

> > >

> > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c

> > > index 15f5504adf5b..e31d67fd5183 100644

> > > --- a/net/ipv4/udp.c

> > > +++ b/net/ipv4/udp.c

> > > @@ -1798,11 +1798,13 @@ int udp_read_sock(struct sock *sk, read_descriptor_t *desc,

> > >               if (used <= 0) {

> > >                       if (!copied)

> > >                               copied = used;

> > > +                     kfree_skb(skb);

> >

> > This case is different from the TCP side, if there is an error

> > the sockmap side will also call kfree_skb(). In TCP side we peek

> > the skb because we don't want to drop it. On UDP side this will

> > just drop data on the floor. Its not super friendly, but its

> > UDP so we are making the assumption this is ok? We've tried

> > to remove all the drop data cases from TCP it would be nice

> > to not drop data on UDP side if we can help it. Could we

> > requeue or peek the UDP skb to avoid this?

> 

> TCP is special because it supports splice() where we can

> do a partial read, so it needs to peek the skb, right? UDP only

> supports sockmap, where we always read a whole skb, so we

> do not need to peek here?


Its also about not dropping data. In TCP we should not drop
data at this point in the stack so if we get an error, ENOMEM
or otherwise, we need to ensure we keep the original skb.

> 

> Thanks.
Cong Wang May 18, 2021, 9:21 p.m. UTC | #4
On Tue, May 18, 2021 at 12:56 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>

> Cong Wang wrote:

> > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > <john.fastabend@gmail.com> wrote:

> > >

> > > Cong Wang wrote:

> > > > From: Cong Wang <cong.wang@bytedance.com>

> > > >

> > > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > > afterward, so udp_read_sock() should free the original skb after

> > > > done using it.

> > >

> > > The clone only happens if sk_psock_verdict_recv() returns >0.

> >

> > Sure, in case of error, no one uses the original skb either,

> > so still need to free it.

>

> But the data is going to be dropped then. I'm questioning if this

> is the best we can do or not. Its simplest sure, but could we

> do a bit more work and peek those skbs or requeue them? Otherwise

> if you cross memory limits for a bit your likely to drop these

> unnecessarily.


What are the benefits of not dropping it? When sockmap takes
over sk->sk_data_ready() it should have total control over the skb's
in the receive queue. Otherwise user-space recvmsg() would race
with sockmap when they try to read the first skb at the same time,
therefore potentially user-space could get duplicated data (one via
recvmsg(), one via sockmap). I don't see any benefits but races here.

Thanks.
John Fastabend May 19, 2021, 7:06 p.m. UTC | #5
Cong Wang wrote:
> On Tue, May 18, 2021 at 12:56 PM John Fastabend

> <john.fastabend@gmail.com> wrote:

> >

> > Cong Wang wrote:

> > > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > > <john.fastabend@gmail.com> wrote:

> > > >

> > > > Cong Wang wrote:

> > > > > From: Cong Wang <cong.wang@bytedance.com>

> > > > >

> > > > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > > > afterward, so udp_read_sock() should free the original skb after

> > > > > done using it.

> > > >

> > > > The clone only happens if sk_psock_verdict_recv() returns >0.

> > >

> > > Sure, in case of error, no one uses the original skb either,

> > > so still need to free it.

> >

> > But the data is going to be dropped then. I'm questioning if this

> > is the best we can do or not. Its simplest sure, but could we

> > do a bit more work and peek those skbs or requeue them? Otherwise

> > if you cross memory limits for a bit your likely to drop these

> > unnecessarily.

> 

> What are the benefits of not dropping it? When sockmap takes

> over sk->sk_data_ready() it should have total control over the skb's

> in the receive queue. Otherwise user-space recvmsg() would race

> with sockmap when they try to read the first skb at the same time,

> therefore potentially user-space could get duplicated data (one via

> recvmsg(), one via sockmap). I don't see any benefits but races here.


The benefit of _not_ dropping it is the packet gets to the receiver
side. We've spent a bit of effort to get a packet across the network,
received on the stack, and then we drop it at the last point is not
so friendly.

About races is the socket is locked by the caller here? Or is this
not the case for UDP.

Its OK in the end to say "its UDP and lossy" but ideally we don't
make things worse by adding sockmap into the stack. We had these
problems already on TCP side, where they are much more severe
because sender believes retransmits will happen, and fixed them
by now. It would be nice if UDP side also didn't introduce
drops.

> 

> Thanks.
Cong Wang May 19, 2021, 8:17 p.m. UTC | #6
On Wed, May 19, 2021 at 12:06 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>

> Cong Wang wrote:

> > On Tue, May 18, 2021 at 12:56 PM John Fastabend

> > <john.fastabend@gmail.com> wrote:

> > >

> > > Cong Wang wrote:

> > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > > > <john.fastabend@gmail.com> wrote:

> > > > >

> > > > > Cong Wang wrote:

> > > > > > From: Cong Wang <cong.wang@bytedance.com>

> > > > > >

> > > > > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > > > > afterward, so udp_read_sock() should free the original skb after

> > > > > > done using it.

> > > > >

> > > > > The clone only happens if sk_psock_verdict_recv() returns >0.

> > > >

> > > > Sure, in case of error, no one uses the original skb either,

> > > > so still need to free it.

> > >

> > > But the data is going to be dropped then. I'm questioning if this

> > > is the best we can do or not. Its simplest sure, but could we

> > > do a bit more work and peek those skbs or requeue them? Otherwise

> > > if you cross memory limits for a bit your likely to drop these

> > > unnecessarily.

> >

> > What are the benefits of not dropping it? When sockmap takes

> > over sk->sk_data_ready() it should have total control over the skb's

> > in the receive queue. Otherwise user-space recvmsg() would race

> > with sockmap when they try to read the first skb at the same time,

> > therefore potentially user-space could get duplicated data (one via

> > recvmsg(), one via sockmap). I don't see any benefits but races here.

>

> The benefit of _not_ dropping it is the packet gets to the receiver

> side. We've spent a bit of effort to get a packet across the network,

> received on the stack, and then we drop it at the last point is not

> so friendly.


Well, at least udp_recvmsg() could drop packets too in various
scenarios, for example, a copy error. So, I do not think sockmap
is special.

>

> About races is the socket is locked by the caller here? Or is this

> not the case for UDP.


Unlike TCP, the sock is not locked during BH for UDP receive path.
Locking it is not the answer here, because 1) we certainly do not want
to slow down UDP fast path; 2) UDP lacks sk->sk_backlog_rcv().

>

> Its OK in the end to say "its UDP and lossy" but ideally we don't

> make things worse by adding sockmap into the stack. We had these

> problems already on TCP side, where they are much more severe

> because sender believes retransmits will happen, and fixed them

> by now. It would be nice if UDP side also didn't introduce

> drops.


Like I said, the normal UDP receive path drops packets too,
sockmap is not different here. TCP does peek packets, for two
reasons: 1) it has to support splice(); 2) it has locked the socket
during BH receive. UDP has none of them, so UDP can't peek
packets here.

Thanks.
John Fastabend May 19, 2021, 9:54 p.m. UTC | #7
Cong Wang wrote:
> On Wed, May 19, 2021 at 12:06 PM John Fastabend

> <john.fastabend@gmail.com> wrote:

> >

> > Cong Wang wrote:

> > > On Tue, May 18, 2021 at 12:56 PM John Fastabend

> > > <john.fastabend@gmail.com> wrote:

> > > >

> > > > Cong Wang wrote:

> > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > > > > <john.fastabend@gmail.com> wrote:

> > > > > >

> > > > > > Cong Wang wrote:

> > > > > > > From: Cong Wang <cong.wang@bytedance.com>

> > > > > > >

> > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > > > > > afterward, so udp_read_sock() should free the original skb after

> > > > > > > done using it.

> > > > > >

> > > > > > The clone only happens if sk_psock_verdict_recv() returns >0.

> > > > >

> > > > > Sure, in case of error, no one uses the original skb either,

> > > > > so still need to free it.

> > > >

> > > > But the data is going to be dropped then. I'm questioning if this

> > > > is the best we can do or not. Its simplest sure, but could we

> > > > do a bit more work and peek those skbs or requeue them? Otherwise

> > > > if you cross memory limits for a bit your likely to drop these

> > > > unnecessarily.

> > >

> > > What are the benefits of not dropping it? When sockmap takes

> > > over sk->sk_data_ready() it should have total control over the skb's

> > > in the receive queue. Otherwise user-space recvmsg() would race

> > > with sockmap when they try to read the first skb at the same time,

> > > therefore potentially user-space could get duplicated data (one via

> > > recvmsg(), one via sockmap). I don't see any benefits but races here.

> >

> > The benefit of _not_ dropping it is the packet gets to the receiver

> > side. We've spent a bit of effort to get a packet across the network,

> > received on the stack, and then we drop it at the last point is not

> > so friendly.

> 

> Well, at least udp_recvmsg() could drop packets too in various

> scenarios, for example, a copy error. So, I do not think sockmap

> is special.


OK I am at least convinced now that dropping packets is OK and likely
a useful performance/complexity compromise.

But, at this point we wont have any visibility into these drops correct?
Looks like the pattern in UDP stack to handle this is to increment
sk_drops and UDP_MIB_INERRORS. How about we do that here as well?
Cong Wang May 19, 2021, 11:26 p.m. UTC | #8
On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote:
>

> Cong Wang wrote:

> > On Wed, May 19, 2021 at 12:06 PM John Fastabend

> > <john.fastabend@gmail.com> wrote:

> > >

> > > Cong Wang wrote:

> > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend

> > > > <john.fastabend@gmail.com> wrote:

> > > > >

> > > > > Cong Wang wrote:

> > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > > > > > <john.fastabend@gmail.com> wrote:

> > > > > > >

> > > > > > > Cong Wang wrote:

> > > > > > > > From: Cong Wang <cong.wang@bytedance.com>

> > > > > > > >

> > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > > > > > > afterward, so udp_read_sock() should free the original skb after

> > > > > > > > done using it.

> > > > > > >

> > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0.

> > > > > >

> > > > > > Sure, in case of error, no one uses the original skb either,

> > > > > > so still need to free it.

> > > > >

> > > > > But the data is going to be dropped then. I'm questioning if this

> > > > > is the best we can do or not. Its simplest sure, but could we

> > > > > do a bit more work and peek those skbs or requeue them? Otherwise

> > > > > if you cross memory limits for a bit your likely to drop these

> > > > > unnecessarily.

> > > >

> > > > What are the benefits of not dropping it? When sockmap takes

> > > > over sk->sk_data_ready() it should have total control over the skb's

> > > > in the receive queue. Otherwise user-space recvmsg() would race

> > > > with sockmap when they try to read the first skb at the same time,

> > > > therefore potentially user-space could get duplicated data (one via

> > > > recvmsg(), one via sockmap). I don't see any benefits but races here.

> > >

> > > The benefit of _not_ dropping it is the packet gets to the receiver

> > > side. We've spent a bit of effort to get a packet across the network,

> > > received on the stack, and then we drop it at the last point is not

> > > so friendly.

> >

> > Well, at least udp_recvmsg() could drop packets too in various

> > scenarios, for example, a copy error. So, I do not think sockmap

> > is special.

>

> OK I am at least convinced now that dropping packets is OK and likely

> a useful performance/complexity compromise.

>

> But, at this point we wont have any visibility into these drops correct?

> Looks like the pattern in UDP stack to handle this is to increment

> sk_drops and UDP_MIB_INERRORS. How about we do that here as well?


We are not dropping the packet, the packet is cloned and deliver to
user-space via sk_psock_verdict_recv(), thus, we are simply leaking
the original skb, regardless of any error. Maybe udp_read_sock()
should check desc->error, but it has nothing to do with this path which
only aims to address a memory leak. A separate patch is need to check
desc->error, if really needed.

Thanks.
John Fastabend May 20, 2021, 5:42 p.m. UTC | #9
Cong Wang wrote:
> On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote:

> >

> > Cong Wang wrote:

> > > On Wed, May 19, 2021 at 12:06 PM John Fastabend

> > > <john.fastabend@gmail.com> wrote:

> > > >

> > > > Cong Wang wrote:

> > > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend

> > > > > <john.fastabend@gmail.com> wrote:

> > > > > >

> > > > > > Cong Wang wrote:

> > > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > > > > > > <john.fastabend@gmail.com> wrote:

> > > > > > > >

> > > > > > > > Cong Wang wrote:

> > > > > > > > > From: Cong Wang <cong.wang@bytedance.com>

> > > > > > > > >

> > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > > > > > > > afterward, so udp_read_sock() should free the original skb after

> > > > > > > > > done using it.

> > > > > > > >

> > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0.

> > > > > > >

> > > > > > > Sure, in case of error, no one uses the original skb either,

> > > > > > > so still need to free it.

> > > > > >

> > > > > > But the data is going to be dropped then. I'm questioning if this

> > > > > > is the best we can do or not. Its simplest sure, but could we

> > > > > > do a bit more work and peek those skbs or requeue them? Otherwise

> > > > > > if you cross memory limits for a bit your likely to drop these

> > > > > > unnecessarily.

> > > > >

> > > > > What are the benefits of not dropping it? When sockmap takes

> > > > > over sk->sk_data_ready() it should have total control over the skb's

> > > > > in the receive queue. Otherwise user-space recvmsg() would race

> > > > > with sockmap when they try to read the first skb at the same time,

> > > > > therefore potentially user-space could get duplicated data (one via

> > > > > recvmsg(), one via sockmap). I don't see any benefits but races here.

> > > >

> > > > The benefit of _not_ dropping it is the packet gets to the receiver

> > > > side. We've spent a bit of effort to get a packet across the network,

> > > > received on the stack, and then we drop it at the last point is not

> > > > so friendly.

> > >

> > > Well, at least udp_recvmsg() could drop packets too in various

> > > scenarios, for example, a copy error. So, I do not think sockmap

> > > is special.

> >

> > OK I am at least convinced now that dropping packets is OK and likely

> > a useful performance/complexity compromise.

> >

> > But, at this point we wont have any visibility into these drops correct?

> > Looks like the pattern in UDP stack to handle this is to increment

> > sk_drops and UDP_MIB_INERRORS. How about we do that here as well?

> 

> We are not dropping the packet, the packet is cloned and deliver to

> user-space via sk_psock_verdict_recv(), thus, we are simply leaking

> the original skb, regardless of any error. Maybe udp_read_sock()

> should check desc->error, but it has nothing to do with this path which

> only aims to address a memory leak. A separate patch is need to check

> desc->error, if really needed.

> 

> Thanks.


"We are not dropping the packet" you'll need to be more explicit on
how this path is not dropping the skb,

  udp_read_sock()
    skb = skb_recv_udp()
     __skb_recv_udp() 
       __skb_try_recv_from_queue()
        __skb_unlink()              // skb is off the queue
    used = recv_actor()
      sk_psock_verdict_recv()
        return 0; 
    if (used <= 0) {
      kfree(skb)         // skb is unlink'd and kfree'd
      break;
    return 0 

So if in the error case the skb is unlink'd from the queue and
kfree'd where is it still existing, how do we get it back? It
sure looks dropped to me. Yes, the kfree() is needed to not
leak it, but I'm saying we don't want to drop packets silently.
The convention in UDP space looks to be inc sk->sk_drop and inc
the MIB. When we have to debug this on deployed systems and
packets silently get dropped its going to cause lots of pain so
lets be sure we get the counters correct.

.John
Cong Wang May 20, 2021, 8:14 p.m. UTC | #10
On Thu, May 20, 2021 at 10:43 AM John Fastabend
<john.fastabend@gmail.com> wrote:
>

> Cong Wang wrote:

> > On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote:

> > >

> > > Cong Wang wrote:

> > > > On Wed, May 19, 2021 at 12:06 PM John Fastabend

> > > > <john.fastabend@gmail.com> wrote:

> > > > >

> > > > > Cong Wang wrote:

> > > > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend

> > > > > > <john.fastabend@gmail.com> wrote:

> > > > > > >

> > > > > > > Cong Wang wrote:

> > > > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > > > > > > > <john.fastabend@gmail.com> wrote:

> > > > > > > > >

> > > > > > > > > Cong Wang wrote:

> > > > > > > > > > From: Cong Wang <cong.wang@bytedance.com>

> > > > > > > > > >

> > > > > > > > > > sk_psock_verdict_recv() clones the skb and uses the clone

> > > > > > > > > > afterward, so udp_read_sock() should free the original skb after

> > > > > > > > > > done using it.

> > > > > > > > >

> > > > > > > > > The clone only happens if sk_psock_verdict_recv() returns >0.

> > > > > > > >

> > > > > > > > Sure, in case of error, no one uses the original skb either,

> > > > > > > > so still need to free it.

> > > > > > >

> > > > > > > But the data is going to be dropped then. I'm questioning if this

> > > > > > > is the best we can do or not. Its simplest sure, but could we

> > > > > > > do a bit more work and peek those skbs or requeue them? Otherwise

> > > > > > > if you cross memory limits for a bit your likely to drop these

> > > > > > > unnecessarily.

> > > > > >

> > > > > > What are the benefits of not dropping it? When sockmap takes

> > > > > > over sk->sk_data_ready() it should have total control over the skb's

> > > > > > in the receive queue. Otherwise user-space recvmsg() would race

> > > > > > with sockmap when they try to read the first skb at the same time,

> > > > > > therefore potentially user-space could get duplicated data (one via

> > > > > > recvmsg(), one via sockmap). I don't see any benefits but races here.

> > > > >

> > > > > The benefit of _not_ dropping it is the packet gets to the receiver

> > > > > side. We've spent a bit of effort to get a packet across the network,

> > > > > received on the stack, and then we drop it at the last point is not

> > > > > so friendly.

> > > >

> > > > Well, at least udp_recvmsg() could drop packets too in various

> > > > scenarios, for example, a copy error. So, I do not think sockmap

> > > > is special.

> > >

> > > OK I am at least convinced now that dropping packets is OK and likely

> > > a useful performance/complexity compromise.

> > >

> > > But, at this point we wont have any visibility into these drops correct?

> > > Looks like the pattern in UDP stack to handle this is to increment

> > > sk_drops and UDP_MIB_INERRORS. How about we do that here as well?

> >

> > We are not dropping the packet, the packet is cloned and deliver to

> > user-space via sk_psock_verdict_recv(), thus, we are simply leaking

> > the original skb, regardless of any error. Maybe udp_read_sock()

> > should check desc->error, but it has nothing to do with this path which

> > only aims to address a memory leak. A separate patch is need to check

> > desc->error, if really needed.

> >

> > Thanks.

>

> "We are not dropping the packet" you'll need to be more explicit on

> how this path is not dropping the skb,


You know it is cloned, don't you? So if we clone an skb and deliver
the clone then free the original, what is dropped here? Absolutely
nothing.

By "drop", we clearly mean nothing is delivered. Or do you have
any different definition of "drop"?

>

>   udp_read_sock()

>     skb = skb_recv_udp()

>      __skb_recv_udp()

>        __skb_try_recv_from_queue()

>         __skb_unlink()              // skb is off the queue

>     used = recv_actor()

>       sk_psock_verdict_recv()


Why do you intentionally ignore the fact the skb is cloned
and the clone is delivered??

>         return 0;

>     if (used <= 0) {

>       kfree(skb)         // skb is unlink'd and kfree'd


Why do you ignore the other kfree_skb() I added in this patch?
Which is clearly on the non-error path. This is why I said the
skb needs to be freed _regardless_ of error or not. You just
keep ignoring it.


>       break;

>     return 0

>

> So if in the error case the skb is unlink'd from the queue and

> kfree'd where is it still existing, how do we get it back? It

> sure looks dropped to me. Yes, the kfree() is needed to not

> leak it, but I'm saying we don't want to drop packets silently.


See above, you clearly ignored the other kfree_skb() which is
on non-error path.

> The convention in UDP space looks to be inc sk->sk_drop and inc

> the MIB. When we have to debug this on deployed systems and

> packets silently get dropped its going to cause lots of pain so

> lets be sure we get the counters correct.


Sure, let me quote what I already said:
" A separate patch is need to check desc->error, if really needed."

This patch, as its subject tells, aims to address a memory leak, not
to address error counters. BTW, TCP does not increase error
counters either, yet another reason it deserves a separate patch
to address both.

Thanks.
John Fastabend May 21, 2021, 10:09 p.m. UTC | #11
Cong Wang wrote:
> On Thu, May 20, 2021 at 10:43 AM John Fastabend

> <john.fastabend@gmail.com> wrote:

> >

> > Cong Wang wrote:

> > > On Wed, May 19, 2021 at 2:54 PM John Fastabend <john.fastabend@gmail.com> wrote:

> > > >

> > > > Cong Wang wrote:

> > > > > On Wed, May 19, 2021 at 12:06 PM John Fastabend

> > > > > <john.fastabend@gmail.com> wrote:

> > > > > >

> > > > > > Cong Wang wrote:

> > > > > > > On Tue, May 18, 2021 at 12:56 PM John Fastabend

> > > > > > > <john.fastabend@gmail.com> wrote:

> > > > > > > >

> > > > > > > > Cong Wang wrote:

> > > > > > > > > On Mon, May 17, 2021 at 10:36 PM John Fastabend

> > > > > > > > > <john.fastabend@gmail.com> wrote:

> > > > > > > > > >

> > > > > > > > > > Cong Wang wrote:

> > > > > > > > > > > From: Cong Wang <cong.wang@bytedance.com>

> > > > > > > > > > >


[...]

> >

> > "We are not dropping the packet" you'll need to be more explicit on

> > how this path is not dropping the skb,

> 

> You know it is cloned, don't you? So if we clone an skb and deliver

> the clone then free the original, what is dropped here? Absolutely

> nothing.

> 

> By "drop", we clearly mean nothing is delivered. Or do you have

> any different definition of "drop"?


This is my meaning of 'drop'.

> 

> >

> >   udp_read_sock()

> >     skb = skb_recv_udp()

> >      __skb_recv_udp()

> >        __skb_try_recv_from_queue()

> >         __skb_unlink()              // skb is off the queue

> >     used = recv_actor()

> >       sk_psock_verdict_recv()

> 

> Why do you intentionally ignore the fact the skb is cloned

> and the clone is delivered??


I'm only concerned about the error case here.

> 

> >         return 0;

> >     if (used <= 0) {

> >       kfree(skb)         // skb is unlink'd and kfree'd

> 

> Why do you ignore the other kfree_skb() I added in this patch?

> Which is clearly on the non-error path. This is why I said the

> skb needs to be freed _regardless_ of error or not. You just

> keep ignoring it.


Agree it needs a kfree_skb in both cases I'm not ignoring it. My
issue with this fix is it is not complete. We need some counter
to increment the udp stats so we can learn about these drops.
And it looks like an extra two lines will get us that.

> 

> 

> >       break;

> >     return 0

> >

> > So if in the error case the skb is unlink'd from the queue and

> > kfree'd where is it still existing, how do we get it back? It

> > sure looks dropped to me. Yes, the kfree() is needed to not

> > leak it, but I'm saying we don't want to drop packets silently.

> 

> See above, you clearly ignored the other kfree_skb() which is

> on non-error path.


Ignored in this thread because its correct and looks good to me.
My only issue is with the error path.

> 

> > The convention in UDP space looks to be inc sk->sk_drop and inc

> > the MIB. When we have to debug this on deployed systems and

> > packets silently get dropped its going to cause lots of pain so

> > lets be sure we get the counters correct.

> 

> Sure, let me quote what I already said:

> " A separate patch is need to check desc->error, if really needed."


Check desc->error or even just used <= 0 is sufficient here. Yes it is
really needed I don't want to debug this in a prod environment
without it.

> 

> This patch, as its subject tells, aims to address a memory leak, not

> to address error counters. BTW, TCP does not increase error


OK either add the counters in this patch or as a series of two
patches so we get a complete fix in one series. Without it some
box out there will randomly drop UDP packets, probably DNS
packets for extra fun, and we will have no way of knowing other
than sporadic packet loss. Unless your arguing against having the
counters or that the counters don't make sense for some reason?

> counters either, yet another reason it deserves a separate patch

> to address both.


TCP case is different if we drop packets in a TCP error case
thats not a 'lets increment the counters' problem the drop needs
to be fixed we can't let data fall out of a TCP stream because
no one will retransmit it. We've learned this the hard way.

> 

> Thanks.
Cong Wang May 21, 2021, 11:39 p.m. UTC | #12
On Fri, May 21, 2021 at 3:09 PM John Fastabend <john.fastabend@gmail.com> wrote:
> OK either add the counters in this patch or as a series of two

> patches so we get a complete fix in one series. Without it some

> box out there will randomly drop UDP packets, probably DNS

> packets for extra fun, and we will have no way of knowing other

> than sporadic packet loss. Unless your arguing against having the

> counters or that the counters don't make sense for some reason?


I never object increasing any counter here, My argument is it
belongs to a separate patch for 3 reasons:

1) TCP does not have one either, hence needs to fix together;

2) A patch should fix one bug, not two or more bugs together;

3) It is not the only one place which needs to increase the
counter, all of these kfree_skb()'s need, for example, this one
inside sk_psock_verdict_recv():

        psock = sk_psock(sk);
        if (unlikely(!psock)) {
                len = 0;
                kfree_skb(skb);
                goto out;
        }

This example also shows it is harder to do so, because
sk_psock_verdict_recv() is independent of any protocol, it is
hard to increase a protocol-specific counter there.

(Another one is in sk_psock_verdict_apply().)

>

> > counters either, yet another reason it deserves a separate patch

> > to address both.

>

> TCP case is different if we drop packets in a TCP error case

> thats not a 'lets increment the counters' problem the drop needs

> to be fixed we can't let data fall out of a TCP stream because

> no one will retransmit it. We've learned this the hard way.


I think TCP always increases some counter when dropping
a packet despite of retransmission, for example:

static void tcp_drop(struct sock *sk, struct sk_buff *skb)
{
        sk_drops_add(sk, skb);
        __kfree_skb(skb);
}

Thanks.
diff mbox series

Patch

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 15f5504adf5b..e31d67fd5183 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1798,11 +1798,13 @@  int udp_read_sock(struct sock *sk, read_descriptor_t *desc,
 		if (used <= 0) {
 			if (!copied)
 				copied = used;
+			kfree_skb(skb);
 			break;
 		} else if (used <= skb->len) {
 			copied += used;
 		}
 
+		kfree_skb(skb);
 		if (!desc->count)
 			break;
 	}