[bpf-next,v2,0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP

Message ID	20210302023743.24123-1-xiyou.wangcong@gmail.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Cong Wang <xiyou.wangcong@gmail.com> To: netdev@vger.kernel.org Cc: bpf@vger.kernel.org, duanxiongchun@bytedance.com, wangdongdong.6@bytedance.com, jiang.wang@bytedance.com, Cong Wang <cong.wang@bytedance.com> Subject: [Patch bpf-next v2 0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP Date: Mon, 1 Mar 2021 18:37:34 -0800 Message-Id: <20210302023743.24123-1-xiyou.wangcong@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	sockmap: introduce BPF_SK_SKB_VERDICT and support UDP \| expand [bpf-next,v2,0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP [bpf-next,v2,1/9] sock_map: introduce BPF_SK_SKB_VERDICT [bpf-next,v2,2/9] sock: introduce sk_prot->update_proto() [bpf-next,v2,3/9] udp: implement ->sendmsg_locked() [bpf-next,v2,4/9] udp: implement ->read_sock() for sockmap [bpf-next,v2,5/9] udp: add ->read_sock() and ->sendmsg_locked() to ipv6 [bpf-next,v2,6/9] skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() [bpf-next,v2,7/9] udp: implement udp_bpf_recvmsg() for sockmap [bpf-next,v2,8/9] sock_map: update sock type checks for UDP [bpf-next,v2,9/9] selftests/bpf: add a test case for udp sockmap

Cong Wang March 2, 2021, 2:37 a.m. UTC

From: Cong Wang <cong.wang@bytedance.com>

We have thousands of services connected to a daemon on every host
via AF_UNIX dgram sockets, after they are moved into VM, we have to
add a proxy to forward these communications from VM to host, because
rewriting thousands of them is not practical. This proxy uses an
AF_UNIX socket connected to services and a UDP socket to connect to
the host. It is inefficient because data is copied between kernel
space and user space twice, and we can not use splice() which only
supports TCP. Therefore, we want to use sockmap to do the splicing
without going to user-space at all (after the initial setup).

Currently sockmap only fully supports TCP, UDP is partially supported
as it is only allowed to add into sockmap. This patchset, as the second
part of the original large patchset, extends sockmap with:
1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support.

On the high level, ->sendmsg_locked() and ->read_sock() are required
for each protocol to support sockmap redirection, and in order to do
sock proto update, a new ops ->update_proto() is introduced, which is
also required to implement. A BPF ->recvmsg() is also needed to replace
the original ->recvmsg() to retrieve skmsg. Please see each patch for
more details.

To see the big picture, the original patchset is available here:
https://github.com/congwang/linux/tree/sockmap
this patchset is also available:
https://github.com/congwang/linux/tree/sockmap2

---
v2: separate from the original large patchset
    rebase to the latest bpf-next
    split UDP test case
    move inet_csk_has_ulp() check to tcp_bpf.c
    clean up udp_read_sock()

Cong Wang (9):
  sock_map: introduce BPF_SK_SKB_VERDICT
  sock: introduce sk_prot->update_proto()
  udp: implement ->sendmsg_locked()
  udp: implement ->read_sock() for sockmap
  udp: add ->read_sock() and ->sendmsg_locked() to ipv6
  skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data()
  udp: implement udp_bpf_recvmsg() for sockmap
  sock_map: update sock type checks for UDP
  selftests/bpf: add a test case for udp sockmap

 include/linux/skmsg.h                         |  25 ++--
 include/net/ipv6.h                            |   1 +
 include/net/sock.h                            |   3 +
 include/net/tcp.h                             |   3 +-
 include/net/udp.h                             |   4 +
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/syscall.c                          |   1 +
 net/core/skmsg.c                              | 113 +++++++++++++-
 net/core/sock_map.c                           |  52 ++++---
 net/ipv4/af_inet.c                            |   2 +
 net/ipv4/tcp_bpf.c                            | 129 +++-------------
 net/ipv4/tcp_ipv4.c                           |   3 +
 net/ipv4/udp.c                                |  68 ++++++++-
 net/ipv4/udp_bpf.c                            |  78 +++++++++-
 net/ipv6/af_inet6.c                           |   2 +
 net/ipv6/tcp_ipv6.c                           |   3 +
 net/ipv6/udp.c                                |  30 +++-
 net/tls/tls_sw.c                              |   4 +-
 tools/bpf/bpftool/common.c                    |   1 +
 tools/bpf/bpftool/prog.c                      |   1 +
 tools/include/uapi/linux/bpf.h                |   1 +
 .../selftests/bpf/prog_tests/sockmap_listen.c | 140 ++++++++++++++++++
 .../selftests/bpf/progs/test_sockmap_listen.c |  22 +++
 23 files changed, 517 insertions(+), 170 deletions(-)

Lorenz Bauer March 2, 2021, 4:22 p.m. UTC | #1

On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

...

> @@ -350,25 +351,12 @@ static inline void sk_psock_cork_free(struct sk_psock *psock)
>         }
>  }
>
> -static inline void sk_psock_update_proto(struct sock *sk,
> -                                        struct sk_psock *psock,
> -                                        struct proto *ops)
> -{
> -       /* Pairs with lockless read in sk_clone_lock() */
> -       WRITE_ONCE(sk->sk_prot, ops);
> -}
> -
>  static inline void sk_psock_restore_proto(struct sock *sk,
>                                           struct sk_psock *psock)
>  {
>         sk->sk_prot->unhash = psock->saved_unhash;

Not related to your patch set, but why do an extra restore of
sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf
/ udp_bpf protos, so overwriting that seems wrong?

> -       if (inet_csk_has_ulp(sk)) {
> -               tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
> -       } else {
> -               sk->sk_write_space = psock->saved_write_space;
> -               /* Pairs with lockless read in sk_clone_lock() */
> -               WRITE_ONCE(sk->sk_prot, psock->sk_proto);
> -       }
> +       if (psock->saved_update_proto)
> +               psock->saved_update_proto(sk, true);
>  }
>
>  static inline void sk_psock_set_state(struct sk_psock *psock,
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 636810ddcd9b..0e8577c917e8 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1184,6 +1184,9 @@ struct proto {
>         void                    (*unhash)(struct sock *sk);
>         void                    (*rehash)(struct sock *sk);
>         int                     (*get_port)(struct sock *sk, unsigned short snum);
> +#ifdef CONFIG_BPF_SYSCALL
> +       int                     (*update_proto)(struct sock *sk, bool restore);

Kind of a nit, but this name suggests that the callback is a lot more
generic than it really is. The only thing you can use it for is to
prep the socket to be sockmap ready since we hardwire sockmap_unhash,
etc. It's also not at all clear that this only works if sk has an
sk_psock associated with it. Calling it without one would crash the
kernel since the update_proto functions don't check for !sk_psock.

Might as well call it install_sockmap_hooks or something and have a
valid sk_psock be passed in to the callback. Additionally, I'd prefer
if the function returned a struct proto * like it does at the moment.
That way we keep sk->sk_prot manipulation confined to the sockmap code
and don't have to copy paste it into every proto.

> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 3bddd9dd2da2..13d2af5bb81c 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)
>
>  static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
>  {
> -       struct proto *prot;
> -
> -       switch (sk->sk_type) {
> -       case SOCK_STREAM:
> -               prot = tcp_bpf_get_proto(sk, psock);
> -               break;
> -
> -       case SOCK_DGRAM:
> -               prot = udp_bpf_get_proto(sk, psock);
> -               break;
> -
> -       default:
> +       if (!sk->sk_prot->update_proto)
>                 return -EINVAL;
> -       }
> -
> -       if (IS_ERR(prot))
> -               return PTR_ERR(prot);
> -
> -       sk_psock_update_proto(sk, psock, prot);
> -       return 0;
> +       psock->saved_update_proto = sk->sk_prot->update_proto;
> +       return sk->sk_prot->update_proto(sk, false);

I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've
not been diligent about this so far, but I think it makes sense to be
careful in new code.

Cong Wang March 2, 2021, 6:23 p.m. UTC | #2

On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>

> On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

>

> ...

> >  static inline void sk_psock_restore_proto(struct sock *sk,

> >                                           struct sk_psock *psock)

> >  {

> >         sk->sk_prot->unhash = psock->saved_unhash;

>

> Not related to your patch set, but why do an extra restore of

> sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf

> / udp_bpf protos, so overwriting that seems wrong?


Good catch. It seems you are right, but I need a double check. And
yes, it is completely unrelated to my patch, as the current code has
the same problem.

>

> > -       if (inet_csk_has_ulp(sk)) {

> > -               tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);

> > -       } else {

> > -               sk->sk_write_space = psock->saved_write_space;

> > -               /* Pairs with lockless read in sk_clone_lock() */

> > -               WRITE_ONCE(sk->sk_prot, psock->sk_proto);

> > -       }

> > +       if (psock->saved_update_proto)

> > +               psock->saved_update_proto(sk, true);

> >  }

> >

> >  static inline void sk_psock_set_state(struct sk_psock *psock,

> > diff --git a/include/net/sock.h b/include/net/sock.h

> > index 636810ddcd9b..0e8577c917e8 100644

> > --- a/include/net/sock.h

> > +++ b/include/net/sock.h

> > @@ -1184,6 +1184,9 @@ struct proto {

> >         void                    (*unhash)(struct sock *sk);

> >         void                    (*rehash)(struct sock *sk);

> >         int                     (*get_port)(struct sock *sk, unsigned short snum);

> > +#ifdef CONFIG_BPF_SYSCALL

> > +       int                     (*update_proto)(struct sock *sk, bool restore);

>

> Kind of a nit, but this name suggests that the callback is a lot more

> generic than it really is. The only thing you can use it for is to

> prep the socket to be sockmap ready since we hardwire sockmap_unhash,

> etc. It's also not at all clear that this only works if sk has an

> sk_psock associated with it. Calling it without one would crash the

> kernel since the update_proto functions don't check for !sk_psock.

>

> Might as well call it install_sockmap_hooks or something and have a

> valid sk_psock be passed in to the callback. Additionally, I'd prefer


For the name, sure, I am always open to better names. Not sure if
'install_sockmap_hooks' is a good name, I also want to express we
are overriding sk_prot. How about 'psock_update_sk_prot'?


> if the function returned a struct proto * like it does at the moment.

> That way we keep sk->sk_prot manipulation confined to the sockmap code

> and don't have to copy paste it into every proto.


Well, TCP seems too special to do this, as it could call tcp_update_ulp()
to update the proto.

>

> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c

> > index 3bddd9dd2da2..13d2af5bb81c 100644

> > --- a/net/core/sock_map.c

> > +++ b/net/core/sock_map.c

> > @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)

> >

> >  static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)

> >  {

> > -       struct proto *prot;

> > -

> > -       switch (sk->sk_type) {

> > -       case SOCK_STREAM:

> > -               prot = tcp_bpf_get_proto(sk, psock);

> > -               break;

> > -

> > -       case SOCK_DGRAM:

> > -               prot = udp_bpf_get_proto(sk, psock);

> > -               break;

> > -

> > -       default:

> > +       if (!sk->sk_prot->update_proto)

> >                 return -EINVAL;

> > -       }

> > -

> > -       if (IS_ERR(prot))

> > -               return PTR_ERR(prot);

> > -

> > -       sk_psock_update_proto(sk, psock, prot);

> > -       return 0;

> > +       psock->saved_update_proto = sk->sk_prot->update_proto;

> > +       return sk->sk_prot->update_proto(sk, false);

>

> I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've

> not been diligent about this so far, but I think it makes sense to be

> careful in new code.


Hmm, there are many places not using READ_ONCE/WRITE_ONCE,
for a quick example:

void sock_map_unhash(struct sock *sk)
{
        void (*saved_unhash)(struct sock *sk);
        struct sk_psock *psock;

        rcu_read_lock();
        psock = sk_psock(sk);
        if (unlikely(!psock)) {
                rcu_read_unlock();
                if (sk->sk_prot->unhash)
                        sk->sk_prot->unhash(sk);
                return;
        }

        saved_unhash = psock->saved_unhash;
        sock_map_remove_links(sk, psock);
        rcu_read_unlock();
        saved_unhash(sk);
}

Thanks.

Yonghong Song March 3, 2021, 6:37 a.m. UTC | #3

On 3/1/21 6:37 PM, Cong Wang wrote:
> From: Cong Wang <cong.wang@bytedance.com>
> 
> Now UDP supports sockmap and redirection, we can safely update
> the sock type checks for it accordingly.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> Cc: Lorenz Bauer <lmb@cloudflare.com>
> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
> ---
>   net/core/sock_map.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 13d2af5bb81c..f7eee4b7b994 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
>   
>   static bool sock_map_redirect_allowed(const struct sock *sk)
>   {
> -	return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
> +	if (sk_is_tcp(sk))
> +		return sk->sk_state != TCP_LISTEN;
> +	else
> +		return sk->sk_state == TCP_ESTABLISHED;

Not a networking expert, a dump question. Here we tested
whether sk_is_tcp(sk) or not, if not we compare
sk->sk_state == TCP_ESTABLISHED, could this be
always false? Mostly I missed something, some comments
here will be good.

>   }
>   
>   static bool sock_map_sk_is_suitable(const struct sock *sk)
>

Lorenz Bauer March 3, 2021, 9:35 a.m. UTC | #4

On Tue, 2 Mar 2021 at 18:23, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > if the function returned a struct proto * like it does at the moment.
> > That way we keep sk->sk_prot manipulation confined to the sockmap code
> > and don't have to copy paste it into every proto.
>
> Well, TCP seems too special to do this, as it could call tcp_update_ulp()
> to update the proto.

I had a quick look, tcp_bpf_update_proto is the only caller of tcp_update_ulp,
which in turn is the only caller of icsk_ulp_ops->update, which in turn is only
implemented as tls_update in tls_main.c. Turns out that tls_update
has another one of these calls:

} else {
    /* Pairs with lockless read in sk_clone_lock(). */
    WRITE_ONCE(sk->sk_prot, p);
    sk->sk_write_space = write_space;
}

Maybe it looks familiar? :o) I think it would be a worthwhile change.

>
> >
> > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> > > index 3bddd9dd2da2..13d2af5bb81c 100644
> > > --- a/net/core/sock_map.c
> > > +++ b/net/core/sock_map.c
> > > @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw)
> > >
> > >  static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock)
> > >  {
> > > -       struct proto *prot;
> > > -
> > > -       switch (sk->sk_type) {
> > > -       case SOCK_STREAM:
> > > -               prot = tcp_bpf_get_proto(sk, psock);
> > > -               break;
> > > -
> > > -       case SOCK_DGRAM:
> > > -               prot = udp_bpf_get_proto(sk, psock);
> > > -               break;
> > > -
> > > -       default:
> > > +       if (!sk->sk_prot->update_proto)
> > >                 return -EINVAL;
> > > -       }
> > > -
> > > -       if (IS_ERR(prot))
> > > -               return PTR_ERR(prot);
> > > -
> > > -       sk_psock_update_proto(sk, psock, prot);
> > > -       return 0;
> > > +       psock->saved_update_proto = sk->sk_prot->update_proto;
> > > +       return sk->sk_prot->update_proto(sk, false);
> >
> > I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've
> > not been diligent about this so far, but I think it makes sense to be
> > careful in new code.
>
> Hmm, there are many places not using READ_ONCE/WRITE_ONCE,
> for a quick example:

I know! I'll defer to John and Jakub.

Cong Wang March 3, 2021, 6:02 p.m. UTC | #5

On Tue, Mar 2, 2021 at 10:37 PM Yonghong Song <yhs@fb.com> wrote:
>

>

>

> On 3/1/21 6:37 PM, Cong Wang wrote:

> > From: Cong Wang <cong.wang@bytedance.com>

> >

> > Now UDP supports sockmap and redirection, we can safely update

> > the sock type checks for it accordingly.

> >

> > Cc: John Fastabend <john.fastabend@gmail.com>

> > Cc: Daniel Borkmann <daniel@iogearbox.net>

> > Cc: Jakub Sitnicki <jakub@cloudflare.com>

> > Cc: Lorenz Bauer <lmb@cloudflare.com>

> > Signed-off-by: Cong Wang <cong.wang@bytedance.com>

> > ---

> >   net/core/sock_map.c | 5 ++++-

> >   1 file changed, 4 insertions(+), 1 deletion(-)

> >

> > diff --git a/net/core/sock_map.c b/net/core/sock_map.c

> > index 13d2af5bb81c..f7eee4b7b994 100644

> > --- a/net/core/sock_map.c

> > +++ b/net/core/sock_map.c

> > @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)

> >

> >   static bool sock_map_redirect_allowed(const struct sock *sk)

> >   {

> > -     return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;

> > +     if (sk_is_tcp(sk))

> > +             return sk->sk_state != TCP_LISTEN;

> > +     else

> > +             return sk->sk_state == TCP_ESTABLISHED;

>

> Not a networking expert, a dump question. Here we tested

> whether sk_is_tcp(sk) or not, if not we compare

> sk->sk_state == TCP_ESTABLISHED, could this be

> always false? Mostly I missed something, some comments

> here will be good.


No, dgram sockets also use TCP_ESTABLISHED as a valid
state. I know its name looks confusing, but it is already widely
used in networking:

net/appletalk/ddp.c:    sk->sk_state = TCP_ESTABLISHED;
net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
net/ax25/af_ax25.c:     sk->sk_state    = TCP_ESTABLISHED;
net/ax25/af_ax25.c:             case TCP_ESTABLISHED: /* connection
established */
net/ax25/af_ax25.c:     if (sk->sk_state == TCP_ESTABLISHED &&
sk->sk_type == SOCK_SEQPACKET) {
net/ax25/af_ax25.c:             sk->sk_state   = TCP_ESTABLISHED;
net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED && (flags
& O_NONBLOCK)) {
net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
net/ax25/af_ax25.c:     if (sk->sk_type == SOCK_SEQPACKET &&
sk->sk_state != TCP_ESTABLISHED) {
net/ax25/ax25_ds_in.c:                  ax25->sk->sk_state = TCP_ESTABLISHED;
net/ax25/ax25_in.c:             make->sk_state = TCP_ESTABLISHED;
net/ax25/ax25_std_in.c:                         ax25->sk->sk_state =
TCP_ESTABLISHED;
net/caif/caif_socket.c: CAIF_CONNECTED          = TCP_ESTABLISHED,
net/ceph/messenger.c:   case TCP_ESTABLISHED:
net/ceph/messenger.c:           dout("%s TCP_ESTABLISHED\n", __func__);
net/core/datagram.c:        !(sk->sk_state == TCP_ESTABLISHED ||
sk->sk_state == TCP_LISTEN))
...

Hence, I believe it is okay to use it as it is, otherwise we would need
to comment on every use of it. ;)

Thanks.

Cong Wang March 3, 2021, 6:20 p.m. UTC | #6

On Wed, Mar 3, 2021 at 1:35 AM Lorenz Bauer <lmb@cloudflare.com> wrote:
>
> On Tue, 2 Mar 2021 at 18:23, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > > if the function returned a struct proto * like it does at the moment.
> > > That way we keep sk->sk_prot manipulation confined to the sockmap code
> > > and don't have to copy paste it into every proto.
> >
> > Well, TCP seems too special to do this, as it could call tcp_update_ulp()
> > to update the proto.
>
> I had a quick look, tcp_bpf_update_proto is the only caller of tcp_update_ulp,
> which in turn is the only caller of icsk_ulp_ops->update, which in turn is only
> implemented as tls_update in tls_main.c. Turns out that tls_update
> has another one of these calls:
>
> } else {
>     /* Pairs with lockless read in sk_clone_lock(). */
>     WRITE_ONCE(sk->sk_prot, p);
>     sk->sk_write_space = write_space;
> }
>
> Maybe it looks familiar? :o) I think it would be a worthwhile change.

Yeah, I am not surprised we can change tcp_update_ulp() too, but
why should I bother kTLS when I do not have to? What you suggest
could at most save us a bit of code size, not a big gain. So, I'd keep
its return value as it is, unless you see any other benefits.

BTW, I will rename it to 'psock_update_sk_prot', please let me know
if you have any better names.

Thanks.

Yonghong Song March 3, 2021, 6:50 p.m. UTC | #7

On 3/3/21 10:02 AM, Cong Wang wrote:
> On Tue, Mar 2, 2021 at 10:37 PM Yonghong Song <yhs@fb.com> wrote:
>>
>>
>>
>> On 3/1/21 6:37 PM, Cong Wang wrote:
>>> From: Cong Wang <cong.wang@bytedance.com>
>>>
>>> Now UDP supports sockmap and redirection, we can safely update
>>> the sock type checks for it accordingly.
>>>
>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>>> Cc: Jakub Sitnicki <jakub@cloudflare.com>
>>> Cc: Lorenz Bauer <lmb@cloudflare.com>
>>> Signed-off-by: Cong Wang <cong.wang@bytedance.com>
>>> ---
>>>    net/core/sock_map.c | 5 ++++-
>>>    1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>> index 13d2af5bb81c..f7eee4b7b994 100644
>>> --- a/net/core/sock_map.c
>>> +++ b/net/core/sock_map.c
>>> @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk)
>>>
>>>    static bool sock_map_redirect_allowed(const struct sock *sk)
>>>    {
>>> -     return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN;
>>> +     if (sk_is_tcp(sk))
>>> +             return sk->sk_state != TCP_LISTEN;
>>> +     else
>>> +             return sk->sk_state == TCP_ESTABLISHED;
>>
>> Not a networking expert, a dump question. Here we tested
>> whether sk_is_tcp(sk) or not, if not we compare
>> sk->sk_state == TCP_ESTABLISHED, could this be
>> always false? Mostly I missed something, some comments
>> here will be good.
> 
> No, dgram sockets also use TCP_ESTABLISHED as a valid
> state. I know its name looks confusing, but it is already widely
> used in networking:
> 
> net/appletalk/ddp.c:    sk->sk_state = TCP_ESTABLISHED;
> net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
> net/appletalk/ddp.c:            if (sk->sk_state != TCP_ESTABLISHED)
> net/ax25/af_ax25.c:     sk->sk_state    = TCP_ESTABLISHED;
> net/ax25/af_ax25.c:             case TCP_ESTABLISHED: /* connection
> established */
> net/ax25/af_ax25.c:     if (sk->sk_state == TCP_ESTABLISHED &&
> sk->sk_type == SOCK_SEQPACKET) {
> net/ax25/af_ax25.c:             sk->sk_state   = TCP_ESTABLISHED;
> net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED && (flags
> & O_NONBLOCK)) {
> net/ax25/af_ax25.c:     if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:             if (sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/af_ax25.c:     if (sk->sk_type == SOCK_SEQPACKET &&
> sk->sk_state != TCP_ESTABLISHED) {
> net/ax25/ax25_ds_in.c:                  ax25->sk->sk_state = TCP_ESTABLISHED;
> net/ax25/ax25_in.c:             make->sk_state = TCP_ESTABLISHED;
> net/ax25/ax25_std_in.c:                         ax25->sk->sk_state =
> TCP_ESTABLISHED;
> net/caif/caif_socket.c: CAIF_CONNECTED          = TCP_ESTABLISHED,
> net/ceph/messenger.c:   case TCP_ESTABLISHED:
> net/ceph/messenger.c:           dout("%s TCP_ESTABLISHED\n", __func__);
> net/core/datagram.c:        !(sk->sk_state == TCP_ESTABLISHED ||
> sk->sk_state == TCP_LISTEN))
> ...
> 
> Hence, I believe it is okay to use it as it is, otherwise we would need
> to comment on every use of it. ;)

That is okay. Thanks for explanation!

> 
> Thanks.
>

Lorenz Bauer March 4, 2021, 9:30 a.m. UTC | #8

On Wed, 3 Mar 2021 at 18:21, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> Yeah, I am not surprised we can change tcp_update_ulp() too, but

> why should I bother kTLS when I do not have to? What you suggest

> could at most save us a bit of code size, not a big gain. So, I'd keep

> its return value as it is, unless you see any other benefits.


I think the end result is code that is easier to understand and
therefore maintain. Keep it as it is if you prefer.

> BTW, I will rename it to 'psock_update_sk_prot', please let me know

> if you have any better names.


SGTM.

-- 
Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK

www.cloudflare.com

Cong Wang March 4, 2021, 11:52 p.m. UTC | #9

On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>

> On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:

> >

> > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > ...

> > >  static inline void sk_psock_restore_proto(struct sock *sk,

> > >                                           struct sk_psock *psock)

> > >  {

> > >         sk->sk_prot->unhash = psock->saved_unhash;

> >

> > Not related to your patch set, but why do an extra restore of

> > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf

> > / udp_bpf protos, so overwriting that seems wrong?

>

> Good catch. It seems you are right, but I need a double check. And

> yes, it is completely unrelated to my patch, as the current code has

> the same problem.


Looking at this again. I noticed

commit 4da6a196f93b1af7612340e8c1ad8ce71e18f955
Author: John Fastabend <john.fastabend@gmail.com>
Date:   Sat Jan 11 06:11:59 2020 +0000

    bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop

intentionally fixed a bug in kTLS with overwriting this ->unhash.

I agree with you that it should not be updated for sockmap case,
however I don't know what to do with kTLS case, it seems the bug the
above commit fixed still exists if we just revert it.

Anyway, this should be targeted for -bpf as a bug fix, so it does not
belong to this patchset.

Thanks.

John Fastabend March 6, 2021, 12:27 a.m. UTC | #10

Cong Wang wrote:
> On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> >

> > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:

> > >

> > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > >

> > > ...

> > > >  static inline void sk_psock_restore_proto(struct sock *sk,

> > > >                                           struct sk_psock *psock)

> > > >  {

> > > >         sk->sk_prot->unhash = psock->saved_unhash;

> > >

> > > Not related to your patch set, but why do an extra restore of

> > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf

> > > / udp_bpf protos, so overwriting that seems wrong?


"extra"? restore_proto should only be called when the psock ref count
is zero and we need to transition back to the original socks proto
handlers. To trigger this we can simply delete a sock from the map.
In the case where we are deleting the psock overwriting the tcp_bpf
protos is exactly what we want.?

> >

> > Good catch. It seems you are right, but I need a double check. And

> > yes, it is completely unrelated to my patch, as the current code has

> > the same problem.

> 

> Looking at this again. I noticed

> 

> commit 4da6a196f93b1af7612340e8c1ad8ce71e18f955

> Author: John Fastabend <john.fastabend@gmail.com>

> Date:   Sat Jan 11 06:11:59 2020 +0000

> 

>     bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop

> 

> intentionally fixed a bug in kTLS with overwriting this ->unhash.

> 

> I agree with you that it should not be updated for sockmap case,

> however I don't know what to do with kTLS case, it seems the bug the

> above commit fixed still exists if we just revert it.

> 

> Anyway, this should be targeted for -bpf as a bug fix, so it does not

> belong to this patchset.

> 

> Thanks.


Hi,

I'm missing the error case here. The restore logic happens when the refcnt
hits 0 on the psock, indicating its time to garbage collect the psock. 

 sk_psock_put
   if (refcount_dec_and_test(&psock->refcnt))
    sk_psock_drop(sk, psock);
      sk_psock_restore_proto(sk, psock)
         sk->sk_prot->unhash = psock->saved_unhash

When sockets are initialized via sk_psock_init() we opulate the unhash field

 psock->saved_unhash = prot->unhash;

So we need to unwind this otherwise a future unhash() call would not call
the original protos unhash handler.

Care to give me some more context on what the bug is?

Thanks,
John

Cong Wang March 6, 2021, 12:57 a.m. UTC | #11

On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote:
>

> Cong Wang wrote:

> > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > >

> > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:

> > > >

> > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > >

> > > > ...

> > > > >  static inline void sk_psock_restore_proto(struct sock *sk,

> > > > >                                           struct sk_psock *psock)

> > > > >  {

> > > > >         sk->sk_prot->unhash = psock->saved_unhash;

> > > >

> > > > Not related to your patch set, but why do an extra restore of

> > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf

> > > > / udp_bpf protos, so overwriting that seems wrong?

>

> "extra"? restore_proto should only be called when the psock ref count

> is zero and we need to transition back to the original socks proto

> handlers. To trigger this we can simply delete a sock from the map.

> In the case where we are deleting the psock overwriting the tcp_bpf

> protos is exactly what we want.?

Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting
tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not.
Because once you overwrite it, the next time you use it to replace
sk->sk_prot, it would be a different one rather than sock_map_unhash():

// tcp_bpf_prots->unhash == sock_map_unhash
sk_psock_restore_proto();
// Now  tcp_bpf_prots->unhash is inet_unhash
...
sk_psock_update_proto();
// sk->sk_proto is now tcp_bpf_prots again,
// so its ->unhash now is inet_unhash
// but it should be sock_map_unhash here

Thanks.

John Fastabend March 6, 2021, 1:55 a.m. UTC | #12

Cong Wang wrote:
> On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote:

> >

> > Cong Wang wrote:

> > > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > >

> > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:

> > > > >

> > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > > >

> > > > > ...

> > > > > >  static inline void sk_psock_restore_proto(struct sock *sk,

> > > > > >                                           struct sk_psock *psock)

> > > > > >  {

> > > > > >         sk->sk_prot->unhash = psock->saved_unhash;

> > > > >

> > > > > Not related to your patch set, but why do an extra restore of

> > > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf

> > > > > / udp_bpf protos, so overwriting that seems wrong?

> >

> > "extra"? restore_proto should only be called when the psock ref count

> > is zero and we need to transition back to the original socks proto

> > handlers. To trigger this we can simply delete a sock from the map.

> > In the case where we are deleting the psock overwriting the tcp_bpf

> > protos is exactly what we want.?

> 

> Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting

> tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not.

> Because once you overwrite it, the next time you use it to replace

> sk->sk_prot, it would be a different one rather than sock_map_unhash():

> 

> // tcp_bpf_prots->unhash == sock_map_unhash

> sk_psock_restore_proto();

> // Now  tcp_bpf_prots->unhash is inet_unhash

> ...

> sk_psock_update_proto();

> // sk->sk_proto is now tcp_bpf_prots again,

> // so its ->unhash now is inet_unhash

> // but it should be sock_map_unhash here


Right, we can fix this on the TLS side. I'll push a fix shortly.

> 

> Thanks.

Cong Wang March 9, 2021, 5:53 p.m. UTC | #13

On Fri, Mar 5, 2021 at 5:55 PM John Fastabend <john.fastabend@gmail.com> wrote:
>

> Cong Wang wrote:

> > On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote:

> > >

> > > Cong Wang wrote:

> > > > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > > >

> > > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote:

> > > > > >

> > > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote:

> > > > > >

> > > > > > ...

> > > > > > >  static inline void sk_psock_restore_proto(struct sock *sk,

> > > > > > >                                           struct sk_psock *psock)

> > > > > > >  {

> > > > > > >         sk->sk_prot->unhash = psock->saved_unhash;

> > > > > >

> > > > > > Not related to your patch set, but why do an extra restore of

> > > > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf

> > > > > > / udp_bpf protos, so overwriting that seems wrong?

> > >

> > > "extra"? restore_proto should only be called when the psock ref count

> > > is zero and we need to transition back to the original socks proto

> > > handlers. To trigger this we can simply delete a sock from the map.

> > > In the case where we are deleting the psock overwriting the tcp_bpf

> > > protos is exactly what we want.?

> >

> > Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting

> > tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not.

> > Because once you overwrite it, the next time you use it to replace

> > sk->sk_prot, it would be a different one rather than sock_map_unhash():

> >

> > // tcp_bpf_prots->unhash == sock_map_unhash

> > sk_psock_restore_proto();

> > // Now  tcp_bpf_prots->unhash is inet_unhash

> > ...

> > sk_psock_update_proto();

> > // sk->sk_proto is now tcp_bpf_prots again,

> > // so its ->unhash now is inet_unhash

> > // but it should be sock_map_unhash here

>

> Right, we can fix this on the TLS side. I'll push a fix shortly.


Are you still working on this? If kTLS still needs it, then we can
have something like this:

diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h
index 8edbbf5f2f93..5eb617df7f48 100644
--- a/include/linux/skmsg.h
+++ b/include/linux/skmsg.h
@@ -349,8 +349,8 @@ static inline void sk_psock_update_proto(struct sock *sk,
 static inline void sk_psock_restore_proto(struct sock *sk,
                                          struct sk_psock *psock)
 {
-       sk->sk_prot->unhash = psock->saved_unhash;
        if (inet_csk_has_ulp(sk)) {
+               sk->sk_prot->unhash = psock->saved_unhash;
                tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);
        } else {
                sk->sk_write_space = psock->saved_write_space;


Thanks.

John Fastabend March 10, 2021, 6:33 a.m. UTC | #14

Cong Wang wrote:
> On Fri, Mar 5, 2021 at 5:55 PM John Fastabend <john.fastabend@gmail.com> wrote:

> >


[...]

> > > // tcp_bpf_prots->unhash == sock_map_unhash

> > > sk_psock_restore_proto();

> > > // Now  tcp_bpf_prots->unhash is inet_unhash

> > > ...

> > > sk_psock_update_proto();

> > > // sk->sk_proto is now tcp_bpf_prots again,

> > > // so its ->unhash now is inet_unhash

> > > // but it should be sock_map_unhash here

> >

> > Right, we can fix this on the TLS side. I'll push a fix shortly.

> 

> Are you still working on this? If kTLS still needs it, then we can

> have something like this:


Testing a fix now I will flush it out tomorrow. The below is not
really correct either it just moves the issue so it only impacts
TLS.

> 

> diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h

> index 8edbbf5f2f93..5eb617df7f48 100644

> --- a/include/linux/skmsg.h

> +++ b/include/linux/skmsg.h

> @@ -349,8 +349,8 @@ static inline void sk_psock_update_proto(struct sock *sk,

>  static inline void sk_psock_restore_proto(struct sock *sk,

>                                           struct sk_psock *psock)

>  {

> -       sk->sk_prot->unhash = psock->saved_unhash;

>         if (inet_csk_has_ulp(sk)) {

> +               sk->sk_prot->unhash = psock->saved_unhash;

>                 tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space);

>         } else {

>                 sk->sk_write_space = psock->saved_write_space;

> 

> 

> Thanks.

[bpf-next,v2,0/9] sockmap: introduce BPF_SK_SKB_VERDICT and support UDP

Message

Comments