Message ID | 160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370 |
---|---|
Headers | show |
Series | sockmap fixes | expand |
On 11/7/20 8:37 PM, John Fastabend wrote: > If copy_page_to_iter() fails or even partially completes, but with fewer > bytes copied than expected we currently reset sg.start and return EFAULT. > This proves problematic if we already copied data into the user buffer > before we return an error. Because we leave the copied data in the user > buffer and fail to unwind the scatterlist so kernel side believes data > has been copied and user side believes data has _not_ been received. > > Expected behavior should be to return number of bytes copied and then > on the next read we need to return the error assuming its still there. This > can happen if we have a copy length spanning multiple scatterlist elements > and one or more complete before the error is hit. > > The error is rare enough though that my normal testing with server side > programs, such as nginx, httpd, envoy, etc., I have never seen this. The > only reliable way to reproduce that I've found is to stream movies over > my browser for a day or so and wait for it to hang. Not very scientific, > but with a few extra WARN_ON()s in the code the bug was obvious. > > When we review the errors from copy_page_to_iter() it seems we are hitting > a page fault from copy_page_to_iter_iovec() where the code checks > fault_in_pages_writeable(buf, copy) where buf is the user buffer. It > also seems typical server applications don't hit this case. > > The other way to try and reproduce this is run the sockmap selftest tool > test_sockmap with data verification enabled, but it doesn't reproduce the > fault. Perhaps we can trigger this case artificially somehow from the > test tools. I haven't sorted out a way to do that yet though. > > Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- > net/ipv4/tcp_bpf.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c > index 37f4cb2bba5c..3709d679436e 100644 > --- a/net/ipv4/tcp_bpf.c > +++ b/net/ipv4/tcp_bpf.c > @@ -15,8 +15,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, > { > struct iov_iter *iter = &msg->msg_iter; > int peek = flags & MSG_PEEK; > - int i, ret, copied = 0; > struct sk_msg *msg_rx; > + int i, copied = 0; > > msg_rx = list_first_entry_or_null(&psock->ingress_msg, > struct sk_msg, list); > @@ -37,10 +37,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, > page = sg_page(sge); > if (copied + copy > len) > copy = len - copied; > - ret = copy_page_to_iter(page, sge->offset, copy, iter); > - if (ret != copy) { > - msg_rx->sg.start = i; > - return -EFAULT; > + copy = copy_page_to_iter(page, sge->offset, copy, iter); > + if (!copy) { > + return copied ? copied : -EFAULT; > } nit: no need for {} > > copied += copy; > @@ -56,6 +55,11 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, > put_page(page); > } > } else { > + /* Lets not optimize peek case if copy_page_to_iter > + * didn't copy the entire length lets just break. > + */ > + if (copy != sge->length) > + goto out; nit: return copied; Rest lgtm for this one. > sk_msg_iter_var_next(i); > } > > @@ -82,6 +86,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, > struct sk_msg, list); > } > > +out: > return copied; > } > EXPORT_SYMBOL_GPL(__tcp_bpf_recvmsg); > >
On 11/7/20 8:38 PM, John Fastabend wrote: > If a socket redirects to itself and it is under memory pressure it is > possible to get a socket stuck so that recv() returns EAGAIN and the > socket can not advance for some time. This happens because when > redirecting a skb to the same socket we received the skb on we first > check if it is OK to enqueue the skb on the receiving socket by checking > memory limits. But, if the skb is itself the object holding the memory > needed to enqueue the skb we will keep retrying from kernel side > and always fail with EAGAIN. Then userspace will get a recv() EAGAIN > error if there are no skbs in the psock ingress queue. This will continue > until either some skbs get kfree'd causing the memory pressure to > reduce far enough that we can enqueue the pending packet or the > socket is destroyed. In some cases its possible to get a socket > stuck for a noticable amount of time if the socket is only receiving > skbs from sk_skb verdict programs. To reproduce I make the socket > memory limits ridiculously low so sockets are always under memory > pressure. More often though if under memory pressure it looks like > a spurious EAGAIN error on user space side causing userspace to retry > and typically enough has moved on the memory side that it works. > > To fix skip memory checks and skb_orphan if receiving on the same > sock as already assigned. > > For SK_PASS cases this is easy, its always the same socket so we > can just omit the orphan/set_owner pair. > > For backlog cases we need to check skb->sk and decide if the orphan > and set_owner pair are needed. > > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- > net/core/skmsg.c | 72 ++++++++++++++++++++++++++++++++++++++++-------------- > 1 file changed, 53 insertions(+), 19 deletions(-) > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > index fe44280c033e..580252e532da 100644 > --- a/net/core/skmsg.c > +++ b/net/core/skmsg.c > @@ -399,38 +399,38 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, > } > EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter); > > -static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) > +static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, > + struct sk_buff *skb) > { > - struct sock *sk = psock->sk; > - int copied = 0, num_sge; > struct sk_msg *msg; > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > - return -EAGAIN; > + return NULL; > + > + if (!sk_rmem_schedule(sk, skb, skb->len)) Isn't accounting always truesize based, thus we should fix & convert all skb->len to skb->truesize ? > + return NULL; > > msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC); > if (unlikely(!msg)) > - return -EAGAIN; > - if (!sk_rmem_schedule(sk, skb, skb->len)) { > - kfree(msg); > - return -EAGAIN; > - } > + return NULL; > > sk_msg_init(msg); > - num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len); > + return msg; > +} > +
Daniel Borkmann wrote: > On 11/7/20 8:38 PM, John Fastabend wrote: > > If a socket redirects to itself and it is under memory pressure it is > > possible to get a socket stuck so that recv() returns EAGAIN and the > > socket can not advance for some time. This happens because when > > redirecting a skb to the same socket we received the skb on we first > > check if it is OK to enqueue the skb on the receiving socket by checking > > memory limits. But, if the skb is itself the object holding the memory > > needed to enqueue the skb we will keep retrying from kernel side > > and always fail with EAGAIN. Then userspace will get a recv() EAGAIN > > error if there are no skbs in the psock ingress queue. This will continue > > until either some skbs get kfree'd causing the memory pressure to > > reduce far enough that we can enqueue the pending packet or the > > socket is destroyed. In some cases its possible to get a socket > > stuck for a noticable amount of time if the socket is only receiving > > skbs from sk_skb verdict programs. To reproduce I make the socket > > memory limits ridiculously low so sockets are always under memory > > pressure. More often though if under memory pressure it looks like > > a spurious EAGAIN error on user space side causing userspace to retry > > and typically enough has moved on the memory side that it works. > > > > To fix skip memory checks and skb_orphan if receiving on the same > > sock as already assigned. > > > > For SK_PASS cases this is easy, its always the same socket so we > > can just omit the orphan/set_owner pair. > > > > For backlog cases we need to check skb->sk and decide if the orphan > > and set_owner pair are needed. > > > > Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") > > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > > --- > > net/core/skmsg.c | 72 ++++++++++++++++++++++++++++++++++++++++-------------- > > 1 file changed, 53 insertions(+), 19 deletions(-) > > > > diff --git a/net/core/skmsg.c b/net/core/skmsg.c > > index fe44280c033e..580252e532da 100644 > > --- a/net/core/skmsg.c > > +++ b/net/core/skmsg.c > > @@ -399,38 +399,38 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, > > } > > EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter); > > > > -static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) > > +static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, > > + struct sk_buff *skb) > > { > > - struct sock *sk = psock->sk; > > - int copied = 0, num_sge; > > struct sk_msg *msg; > > > > if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) > > - return -EAGAIN; > > + return NULL; > > + > > + if (!sk_rmem_schedule(sk, skb, skb->len)) > > Isn't accounting always truesize based, thus we should fix & convert all skb->len > to skb->truesize ? Right good catch, will fix in v2.
Daniel Borkmann wrote: > On 11/7/20 8:37 PM, John Fastabend wrote: > > If copy_page_to_iter() fails or even partially completes, but with fewer > > bytes copied than expected we currently reset sg.start and return EFAULT. > > This proves problematic if we already copied data into the user buffer > > before we return an error. Because we leave the copied data in the user > > buffer and fail to unwind the scatterlist so kernel side believes data > > has been copied and user side believes data has _not_ been received. [...] > > + if (!copy) { > > + return copied ? copied : -EFAULT; > > } > > nit: no need for {} > > > > > copied += copy; > > @@ -56,6 +55,11 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, > > put_page(page); > > } > > } else { > > + /* Lets not optimize peek case if copy_page_to_iter > > + * didn't copy the entire length lets just break. > > + */ > > + if (copy != sge->length) > > + goto out; > > nit: return copied; > > Rest lgtm for this one. Great, thanks for the review will fixup in v2.