Message ID | 20210302023743.24123-1-xiyou.wangcong@gmail.com |
---|---|
Headers | show |
Series | sockmap: introduce BPF_SK_SKB_VERDICT and support UDP | expand |
On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote: ... > @@ -350,25 +351,12 @@ static inline void sk_psock_cork_free(struct sk_psock *psock) > } > } > > -static inline void sk_psock_update_proto(struct sock *sk, > - struct sk_psock *psock, > - struct proto *ops) > -{ > - /* Pairs with lockless read in sk_clone_lock() */ > - WRITE_ONCE(sk->sk_prot, ops); > -} > - > static inline void sk_psock_restore_proto(struct sock *sk, > struct sk_psock *psock) > { > sk->sk_prot->unhash = psock->saved_unhash; Not related to your patch set, but why do an extra restore of sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf / udp_bpf protos, so overwriting that seems wrong? > - if (inet_csk_has_ulp(sk)) { > - tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space); > - } else { > - sk->sk_write_space = psock->saved_write_space; > - /* Pairs with lockless read in sk_clone_lock() */ > - WRITE_ONCE(sk->sk_prot, psock->sk_proto); > - } > + if (psock->saved_update_proto) > + psock->saved_update_proto(sk, true); > } > > static inline void sk_psock_set_state(struct sk_psock *psock, > diff --git a/include/net/sock.h b/include/net/sock.h > index 636810ddcd9b..0e8577c917e8 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -1184,6 +1184,9 @@ struct proto { > void (*unhash)(struct sock *sk); > void (*rehash)(struct sock *sk); > int (*get_port)(struct sock *sk, unsigned short snum); > +#ifdef CONFIG_BPF_SYSCALL > + int (*update_proto)(struct sock *sk, bool restore); Kind of a nit, but this name suggests that the callback is a lot more generic than it really is. The only thing you can use it for is to prep the socket to be sockmap ready since we hardwire sockmap_unhash, etc. It's also not at all clear that this only works if sk has an sk_psock associated with it. Calling it without one would crash the kernel since the update_proto functions don't check for !sk_psock. Might as well call it install_sockmap_hooks or something and have a valid sk_psock be passed in to the callback. Additionally, I'd prefer if the function returned a struct proto * like it does at the moment. That way we keep sk->sk_prot manipulation confined to the sockmap code and don't have to copy paste it into every proto. > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index 3bddd9dd2da2..13d2af5bb81c 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw) > > static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock) > { > - struct proto *prot; > - > - switch (sk->sk_type) { > - case SOCK_STREAM: > - prot = tcp_bpf_get_proto(sk, psock); > - break; > - > - case SOCK_DGRAM: > - prot = udp_bpf_get_proto(sk, psock); > - break; > - > - default: > + if (!sk->sk_prot->update_proto) > return -EINVAL; > - } > - > - if (IS_ERR(prot)) > - return PTR_ERR(prot); > - > - sk_psock_update_proto(sk, psock, prot); > - return 0; > + psock->saved_update_proto = sk->sk_prot->update_proto; > + return sk->sk_prot->update_proto(sk, false); I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've not been diligent about this so far, but I think it makes sense to be careful in new code.
On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > ... > > static inline void sk_psock_restore_proto(struct sock *sk, > > struct sk_psock *psock) > > { > > sk->sk_prot->unhash = psock->saved_unhash; > > Not related to your patch set, but why do an extra restore of > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf > / udp_bpf protos, so overwriting that seems wrong? Good catch. It seems you are right, but I need a double check. And yes, it is completely unrelated to my patch, as the current code has the same problem. > > > - if (inet_csk_has_ulp(sk)) { > > - tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space); > > - } else { > > - sk->sk_write_space = psock->saved_write_space; > > - /* Pairs with lockless read in sk_clone_lock() */ > > - WRITE_ONCE(sk->sk_prot, psock->sk_proto); > > - } > > + if (psock->saved_update_proto) > > + psock->saved_update_proto(sk, true); > > } > > > > static inline void sk_psock_set_state(struct sk_psock *psock, > > diff --git a/include/net/sock.h b/include/net/sock.h > > index 636810ddcd9b..0e8577c917e8 100644 > > --- a/include/net/sock.h > > +++ b/include/net/sock.h > > @@ -1184,6 +1184,9 @@ struct proto { > > void (*unhash)(struct sock *sk); > > void (*rehash)(struct sock *sk); > > int (*get_port)(struct sock *sk, unsigned short snum); > > +#ifdef CONFIG_BPF_SYSCALL > > + int (*update_proto)(struct sock *sk, bool restore); > > Kind of a nit, but this name suggests that the callback is a lot more > generic than it really is. The only thing you can use it for is to > prep the socket to be sockmap ready since we hardwire sockmap_unhash, > etc. It's also not at all clear that this only works if sk has an > sk_psock associated with it. Calling it without one would crash the > kernel since the update_proto functions don't check for !sk_psock. > > Might as well call it install_sockmap_hooks or something and have a > valid sk_psock be passed in to the callback. Additionally, I'd prefer For the name, sure, I am always open to better names. Not sure if 'install_sockmap_hooks' is a good name, I also want to express we are overriding sk_prot. How about 'psock_update_sk_prot'? > if the function returned a struct proto * like it does at the moment. > That way we keep sk->sk_prot manipulation confined to the sockmap code > and don't have to copy paste it into every proto. Well, TCP seems too special to do this, as it could call tcp_update_ulp() to update the proto. > > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > index 3bddd9dd2da2..13d2af5bb81c 100644 > > --- a/net/core/sock_map.c > > +++ b/net/core/sock_map.c > > @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw) > > > > static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock) > > { > > - struct proto *prot; > > - > > - switch (sk->sk_type) { > > - case SOCK_STREAM: > > - prot = tcp_bpf_get_proto(sk, psock); > > - break; > > - > > - case SOCK_DGRAM: > > - prot = udp_bpf_get_proto(sk, psock); > > - break; > > - > > - default: > > + if (!sk->sk_prot->update_proto) > > return -EINVAL; > > - } > > - > > - if (IS_ERR(prot)) > > - return PTR_ERR(prot); > > - > > - sk_psock_update_proto(sk, psock, prot); > > - return 0; > > + psock->saved_update_proto = sk->sk_prot->update_proto; > > + return sk->sk_prot->update_proto(sk, false); > > I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've > not been diligent about this so far, but I think it makes sense to be > careful in new code. Hmm, there are many places not using READ_ONCE/WRITE_ONCE, for a quick example: void sock_map_unhash(struct sock *sk) { void (*saved_unhash)(struct sock *sk); struct sk_psock *psock; rcu_read_lock(); psock = sk_psock(sk); if (unlikely(!psock)) { rcu_read_unlock(); if (sk->sk_prot->unhash) sk->sk_prot->unhash(sk); return; } saved_unhash = psock->saved_unhash; sock_map_remove_links(sk, psock); rcu_read_unlock(); saved_unhash(sk); } Thanks.
On 3/1/21 6:37 PM, Cong Wang wrote: > From: Cong Wang <cong.wang@bytedance.com> > > Now UDP supports sockmap and redirection, we can safely update > the sock type checks for it accordingly. > > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: Jakub Sitnicki <jakub@cloudflare.com> > Cc: Lorenz Bauer <lmb@cloudflare.com> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > --- > net/core/sock_map.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > index 13d2af5bb81c..f7eee4b7b994 100644 > --- a/net/core/sock_map.c > +++ b/net/core/sock_map.c > @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk) > > static bool sock_map_redirect_allowed(const struct sock *sk) > { > - return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN; > + if (sk_is_tcp(sk)) > + return sk->sk_state != TCP_LISTEN; > + else > + return sk->sk_state == TCP_ESTABLISHED; Not a networking expert, a dump question. Here we tested whether sk_is_tcp(sk) or not, if not we compare sk->sk_state == TCP_ESTABLISHED, could this be always false? Mostly I missed something, some comments here will be good. > } > > static bool sock_map_sk_is_suitable(const struct sock *sk) >
On Tue, 2 Mar 2021 at 18:23, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > if the function returned a struct proto * like it does at the moment. > > That way we keep sk->sk_prot manipulation confined to the sockmap code > > and don't have to copy paste it into every proto. > > Well, TCP seems too special to do this, as it could call tcp_update_ulp() > to update the proto. I had a quick look, tcp_bpf_update_proto is the only caller of tcp_update_ulp, which in turn is the only caller of icsk_ulp_ops->update, which in turn is only implemented as tls_update in tls_main.c. Turns out that tls_update has another one of these calls: } else { /* Pairs with lockless read in sk_clone_lock(). */ WRITE_ONCE(sk->sk_prot, p); sk->sk_write_space = write_space; } Maybe it looks familiar? :o) I think it would be a worthwhile change. > > > > > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > > index 3bddd9dd2da2..13d2af5bb81c 100644 > > > --- a/net/core/sock_map.c > > > +++ b/net/core/sock_map.c > > > @@ -184,26 +184,10 @@ static void sock_map_unref(struct sock *sk, void *link_raw) > > > > > > static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock) > > > { > > > - struct proto *prot; > > > - > > > - switch (sk->sk_type) { > > > - case SOCK_STREAM: > > > - prot = tcp_bpf_get_proto(sk, psock); > > > - break; > > > - > > > - case SOCK_DGRAM: > > > - prot = udp_bpf_get_proto(sk, psock); > > > - break; > > > - > > > - default: > > > + if (!sk->sk_prot->update_proto) > > > return -EINVAL; > > > - } > > > - > > > - if (IS_ERR(prot)) > > > - return PTR_ERR(prot); > > > - > > > - sk_psock_update_proto(sk, psock, prot); > > > - return 0; > > > + psock->saved_update_proto = sk->sk_prot->update_proto; > > > + return sk->sk_prot->update_proto(sk, false); > > > > I think reads / writes from sk_prot need READ_ONCE / WRITE_ONCE. We've > > not been diligent about this so far, but I think it makes sense to be > > careful in new code. > > Hmm, there are many places not using READ_ONCE/WRITE_ONCE, > for a quick example: I know! I'll defer to John and Jakub.
On Tue, Mar 2, 2021 at 10:37 PM Yonghong Song <yhs@fb.com> wrote: > > > > On 3/1/21 6:37 PM, Cong Wang wrote: > > From: Cong Wang <cong.wang@bytedance.com> > > > > Now UDP supports sockmap and redirection, we can safely update > > the sock type checks for it accordingly. > > > > Cc: John Fastabend <john.fastabend@gmail.com> > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > Cc: Lorenz Bauer <lmb@cloudflare.com> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > --- > > net/core/sock_map.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/net/core/sock_map.c b/net/core/sock_map.c > > index 13d2af5bb81c..f7eee4b7b994 100644 > > --- a/net/core/sock_map.c > > +++ b/net/core/sock_map.c > > @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk) > > > > static bool sock_map_redirect_allowed(const struct sock *sk) > > { > > - return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN; > > + if (sk_is_tcp(sk)) > > + return sk->sk_state != TCP_LISTEN; > > + else > > + return sk->sk_state == TCP_ESTABLISHED; > > Not a networking expert, a dump question. Here we tested > whether sk_is_tcp(sk) or not, if not we compare > sk->sk_state == TCP_ESTABLISHED, could this be > always false? Mostly I missed something, some comments > here will be good. No, dgram sockets also use TCP_ESTABLISHED as a valid state. I know its name looks confusing, but it is already widely used in networking: net/appletalk/ddp.c: sk->sk_state = TCP_ESTABLISHED; net/appletalk/ddp.c: if (sk->sk_state != TCP_ESTABLISHED) net/appletalk/ddp.c: if (sk->sk_state != TCP_ESTABLISHED) net/ax25/af_ax25.c: sk->sk_state = TCP_ESTABLISHED; net/ax25/af_ax25.c: case TCP_ESTABLISHED: /* connection established */ net/ax25/af_ax25.c: if (sk->sk_state == TCP_ESTABLISHED && sk->sk_type == SOCK_SEQPACKET) { net/ax25/af_ax25.c: sk->sk_state = TCP_ESTABLISHED; net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED && (flags & O_NONBLOCK)) { net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { net/ax25/af_ax25.c: if (sk->sk_type == SOCK_SEQPACKET && sk->sk_state != TCP_ESTABLISHED) { net/ax25/ax25_ds_in.c: ax25->sk->sk_state = TCP_ESTABLISHED; net/ax25/ax25_in.c: make->sk_state = TCP_ESTABLISHED; net/ax25/ax25_std_in.c: ax25->sk->sk_state = TCP_ESTABLISHED; net/caif/caif_socket.c: CAIF_CONNECTED = TCP_ESTABLISHED, net/ceph/messenger.c: case TCP_ESTABLISHED: net/ceph/messenger.c: dout("%s TCP_ESTABLISHED\n", __func__); net/core/datagram.c: !(sk->sk_state == TCP_ESTABLISHED || sk->sk_state == TCP_LISTEN)) ... Hence, I believe it is okay to use it as it is, otherwise we would need to comment on every use of it. ;) Thanks.
On Wed, Mar 3, 2021 at 1:35 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > > On Tue, 2 Mar 2021 at 18:23, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > if the function returned a struct proto * like it does at the moment. > > > That way we keep sk->sk_prot manipulation confined to the sockmap code > > > and don't have to copy paste it into every proto. > > > > Well, TCP seems too special to do this, as it could call tcp_update_ulp() > > to update the proto. > > I had a quick look, tcp_bpf_update_proto is the only caller of tcp_update_ulp, > which in turn is the only caller of icsk_ulp_ops->update, which in turn is only > implemented as tls_update in tls_main.c. Turns out that tls_update > has another one of these calls: > > } else { > /* Pairs with lockless read in sk_clone_lock(). */ > WRITE_ONCE(sk->sk_prot, p); > sk->sk_write_space = write_space; > } > > Maybe it looks familiar? :o) I think it would be a worthwhile change. Yeah, I am not surprised we can change tcp_update_ulp() too, but why should I bother kTLS when I do not have to? What you suggest could at most save us a bit of code size, not a big gain. So, I'd keep its return value as it is, unless you see any other benefits. BTW, I will rename it to 'psock_update_sk_prot', please let me know if you have any better names. Thanks.
On 3/3/21 10:02 AM, Cong Wang wrote: > On Tue, Mar 2, 2021 at 10:37 PM Yonghong Song <yhs@fb.com> wrote: >> >> >> >> On 3/1/21 6:37 PM, Cong Wang wrote: >>> From: Cong Wang <cong.wang@bytedance.com> >>> >>> Now UDP supports sockmap and redirection, we can safely update >>> the sock type checks for it accordingly. >>> >>> Cc: John Fastabend <john.fastabend@gmail.com> >>> Cc: Daniel Borkmann <daniel@iogearbox.net> >>> Cc: Jakub Sitnicki <jakub@cloudflare.com> >>> Cc: Lorenz Bauer <lmb@cloudflare.com> >>> Signed-off-by: Cong Wang <cong.wang@bytedance.com> >>> --- >>> net/core/sock_map.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c >>> index 13d2af5bb81c..f7eee4b7b994 100644 >>> --- a/net/core/sock_map.c >>> +++ b/net/core/sock_map.c >>> @@ -549,7 +549,10 @@ static bool sk_is_udp(const struct sock *sk) >>> >>> static bool sock_map_redirect_allowed(const struct sock *sk) >>> { >>> - return sk_is_tcp(sk) && sk->sk_state != TCP_LISTEN; >>> + if (sk_is_tcp(sk)) >>> + return sk->sk_state != TCP_LISTEN; >>> + else >>> + return sk->sk_state == TCP_ESTABLISHED; >> >> Not a networking expert, a dump question. Here we tested >> whether sk_is_tcp(sk) or not, if not we compare >> sk->sk_state == TCP_ESTABLISHED, could this be >> always false? Mostly I missed something, some comments >> here will be good. > > No, dgram sockets also use TCP_ESTABLISHED as a valid > state. I know its name looks confusing, but it is already widely > used in networking: > > net/appletalk/ddp.c: sk->sk_state = TCP_ESTABLISHED; > net/appletalk/ddp.c: if (sk->sk_state != TCP_ESTABLISHED) > net/appletalk/ddp.c: if (sk->sk_state != TCP_ESTABLISHED) > net/ax25/af_ax25.c: sk->sk_state = TCP_ESTABLISHED; > net/ax25/af_ax25.c: case TCP_ESTABLISHED: /* connection > established */ > net/ax25/af_ax25.c: if (sk->sk_state == TCP_ESTABLISHED && > sk->sk_type == SOCK_SEQPACKET) { > net/ax25/af_ax25.c: sk->sk_state = TCP_ESTABLISHED; > net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED && (flags > & O_NONBLOCK)) { > net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { > net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { > net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { > net/ax25/af_ax25.c: if (sk->sk_state != TCP_ESTABLISHED) { > net/ax25/af_ax25.c: if (sk->sk_type == SOCK_SEQPACKET && > sk->sk_state != TCP_ESTABLISHED) { > net/ax25/ax25_ds_in.c: ax25->sk->sk_state = TCP_ESTABLISHED; > net/ax25/ax25_in.c: make->sk_state = TCP_ESTABLISHED; > net/ax25/ax25_std_in.c: ax25->sk->sk_state = > TCP_ESTABLISHED; > net/caif/caif_socket.c: CAIF_CONNECTED = TCP_ESTABLISHED, > net/ceph/messenger.c: case TCP_ESTABLISHED: > net/ceph/messenger.c: dout("%s TCP_ESTABLISHED\n", __func__); > net/core/datagram.c: !(sk->sk_state == TCP_ESTABLISHED || > sk->sk_state == TCP_LISTEN)) > ... > > Hence, I believe it is okay to use it as it is, otherwise we would need > to comment on every use of it. ;) That is okay. Thanks for explanation! > > Thanks. >
On Wed, 3 Mar 2021 at 18:21, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > Yeah, I am not surprised we can change tcp_update_ulp() too, but > why should I bother kTLS when I do not have to? What you suggest > could at most save us a bit of code size, not a big gain. So, I'd keep > its return value as it is, unless you see any other benefits. I think the end result is code that is easier to understand and therefore maintain. Keep it as it is if you prefer. > BTW, I will rename it to 'psock_update_sk_prot', please let me know > if you have any better names. SGTM. -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com
On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > ... > > > static inline void sk_psock_restore_proto(struct sock *sk, > > > struct sk_psock *psock) > > > { > > > sk->sk_prot->unhash = psock->saved_unhash; > > > > Not related to your patch set, but why do an extra restore of > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf > > / udp_bpf protos, so overwriting that seems wrong? > > Good catch. It seems you are right, but I need a double check. And > yes, it is completely unrelated to my patch, as the current code has > the same problem. Looking at this again. I noticed commit 4da6a196f93b1af7612340e8c1ad8ce71e18f955 Author: John Fastabend <john.fastabend@gmail.com> Date: Sat Jan 11 06:11:59 2020 +0000 bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop intentionally fixed a bug in kTLS with overwriting this ->unhash. I agree with you that it should not be updated for sockmap case, however I don't know what to do with kTLS case, it seems the bug the above commit fixed still exists if we just revert it. Anyway, this should be targeted for -bpf as a bug fix, so it does not belong to this patchset. Thanks.
Cong Wang wrote: > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > > > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > ... > > > > static inline void sk_psock_restore_proto(struct sock *sk, > > > > struct sk_psock *psock) > > > > { > > > > sk->sk_prot->unhash = psock->saved_unhash; > > > > > > Not related to your patch set, but why do an extra restore of > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf > > > / udp_bpf protos, so overwriting that seems wrong? "extra"? restore_proto should only be called when the psock ref count is zero and we need to transition back to the original socks proto handlers. To trigger this we can simply delete a sock from the map. In the case where we are deleting the psock overwriting the tcp_bpf protos is exactly what we want.? > > > > Good catch. It seems you are right, but I need a double check. And > > yes, it is completely unrelated to my patch, as the current code has > > the same problem. > > Looking at this again. I noticed > > commit 4da6a196f93b1af7612340e8c1ad8ce71e18f955 > Author: John Fastabend <john.fastabend@gmail.com> > Date: Sat Jan 11 06:11:59 2020 +0000 > > bpf: Sockmap/tls, during free we may call tcp_bpf_unhash() in loop > > intentionally fixed a bug in kTLS with overwriting this ->unhash. > > I agree with you that it should not be updated for sockmap case, > however I don't know what to do with kTLS case, it seems the bug the > above commit fixed still exists if we just revert it. > > Anyway, this should be targeted for -bpf as a bug fix, so it does not > belong to this patchset. > > Thanks. Hi, I'm missing the error case here. The restore logic happens when the refcnt hits 0 on the psock, indicating its time to garbage collect the psock. sk_psock_put if (refcount_dec_and_test(&psock->refcnt)) sk_psock_drop(sk, psock); sk_psock_restore_proto(sk, psock) sk->sk_prot->unhash = psock->saved_unhash When sockets are initialized via sk_psock_init() we opulate the unhash field psock->saved_unhash = prot->unhash; So we need to unwind this otherwise a future unhash() call would not call the original protos unhash handler. Care to give me some more context on what the bug is? Thanks, John
On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > > > > > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > > > ... > > > > > static inline void sk_psock_restore_proto(struct sock *sk, > > > > > struct sk_psock *psock) > > > > > { > > > > > sk->sk_prot->unhash = psock->saved_unhash; > > > > > > > > Not related to your patch set, but why do an extra restore of > > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf > > > > / udp_bpf protos, so overwriting that seems wrong? > > "extra"? restore_proto should only be called when the psock ref count > is zero and we need to transition back to the original socks proto > handlers. To trigger this we can simply delete a sock from the map. > In the case where we are deleting the psock overwriting the tcp_bpf > protos is exactly what we want.? Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not. Because once you overwrite it, the next time you use it to replace sk->sk_prot, it would be a different one rather than sock_map_unhash(): // tcp_bpf_prots->unhash == sock_map_unhash sk_psock_restore_proto(); // Now tcp_bpf_prots->unhash is inet_unhash ... sk_psock_update_proto(); // sk->sk_proto is now tcp_bpf_prots again, // so its ->unhash now is inet_unhash // but it should be sock_map_unhash here Thanks.
Cong Wang wrote: > On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > > > > > > > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > > > > > ... > > > > > > static inline void sk_psock_restore_proto(struct sock *sk, > > > > > > struct sk_psock *psock) > > > > > > { > > > > > > sk->sk_prot->unhash = psock->saved_unhash; > > > > > > > > > > Not related to your patch set, but why do an extra restore of > > > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf > > > > > / udp_bpf protos, so overwriting that seems wrong? > > > > "extra"? restore_proto should only be called when the psock ref count > > is zero and we need to transition back to the original socks proto > > handlers. To trigger this we can simply delete a sock from the map. > > In the case where we are deleting the psock overwriting the tcp_bpf > > protos is exactly what we want.? > > Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting > tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not. > Because once you overwrite it, the next time you use it to replace > sk->sk_prot, it would be a different one rather than sock_map_unhash(): > > // tcp_bpf_prots->unhash == sock_map_unhash > sk_psock_restore_proto(); > // Now tcp_bpf_prots->unhash is inet_unhash > ... > sk_psock_update_proto(); > // sk->sk_proto is now tcp_bpf_prots again, > // so its ->unhash now is inet_unhash > // but it should be sock_map_unhash here Right, we can fix this on the TLS side. I'll push a fix shortly. > > Thanks.
On Fri, Mar 5, 2021 at 5:55 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > On Fri, Mar 5, 2021 at 4:27 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > > > Cong Wang wrote: > > > > On Tue, Mar 2, 2021 at 10:23 AM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > > > > > On Tue, Mar 2, 2021 at 8:22 AM Lorenz Bauer <lmb@cloudflare.com> wrote: > > > > > > > > > > > > On Tue, 2 Mar 2021 at 02:37, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > > > > > > > > > ... > > > > > > > static inline void sk_psock_restore_proto(struct sock *sk, > > > > > > > struct sk_psock *psock) > > > > > > > { > > > > > > > sk->sk_prot->unhash = psock->saved_unhash; > > > > > > > > > > > > Not related to your patch set, but why do an extra restore of > > > > > > sk_prot->unhash here? At this point sk->sk_prot is one of our tcp_bpf > > > > > > / udp_bpf protos, so overwriting that seems wrong? > > > > > > "extra"? restore_proto should only be called when the psock ref count > > > is zero and we need to transition back to the original socks proto > > > handlers. To trigger this we can simply delete a sock from the map. > > > In the case where we are deleting the psock overwriting the tcp_bpf > > > protos is exactly what we want.? > > > > Why do you want to overwrite tcp_bpf_prots->unhash? Overwriting > > tcp_bpf_prots is correct, but overwriting tcp_bpf_prots->unhash is not. > > Because once you overwrite it, the next time you use it to replace > > sk->sk_prot, it would be a different one rather than sock_map_unhash(): > > > > // tcp_bpf_prots->unhash == sock_map_unhash > > sk_psock_restore_proto(); > > // Now tcp_bpf_prots->unhash is inet_unhash > > ... > > sk_psock_update_proto(); > > // sk->sk_proto is now tcp_bpf_prots again, > > // so its ->unhash now is inet_unhash > > // but it should be sock_map_unhash here > > Right, we can fix this on the TLS side. I'll push a fix shortly. Are you still working on this? If kTLS still needs it, then we can have something like this: diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 8edbbf5f2f93..5eb617df7f48 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -349,8 +349,8 @@ static inline void sk_psock_update_proto(struct sock *sk, static inline void sk_psock_restore_proto(struct sock *sk, struct sk_psock *psock) { - sk->sk_prot->unhash = psock->saved_unhash; if (inet_csk_has_ulp(sk)) { + sk->sk_prot->unhash = psock->saved_unhash; tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space); } else { sk->sk_write_space = psock->saved_write_space; Thanks.
Cong Wang wrote: > On Fri, Mar 5, 2021 at 5:55 PM John Fastabend <john.fastabend@gmail.com> wrote: > > [...] > > > // tcp_bpf_prots->unhash == sock_map_unhash > > > sk_psock_restore_proto(); > > > // Now tcp_bpf_prots->unhash is inet_unhash > > > ... > > > sk_psock_update_proto(); > > > // sk->sk_proto is now tcp_bpf_prots again, > > > // so its ->unhash now is inet_unhash > > > // but it should be sock_map_unhash here > > > > Right, we can fix this on the TLS side. I'll push a fix shortly. > > Are you still working on this? If kTLS still needs it, then we can > have something like this: Testing a fix now I will flush it out tomorrow. The below is not really correct either it just moves the issue so it only impacts TLS. > > diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h > index 8edbbf5f2f93..5eb617df7f48 100644 > --- a/include/linux/skmsg.h > +++ b/include/linux/skmsg.h > @@ -349,8 +349,8 @@ static inline void sk_psock_update_proto(struct sock *sk, > static inline void sk_psock_restore_proto(struct sock *sk, > struct sk_psock *psock) > { > - sk->sk_prot->unhash = psock->saved_unhash; > if (inet_csk_has_ulp(sk)) { > + sk->sk_prot->unhash = psock->saved_unhash; > tcp_update_ulp(sk, psock->sk_proto, psock->saved_write_space); > } else { > sk->sk_write_space = psock->saved_write_space; > > > Thanks.
From: Cong Wang <cong.wang@bytedance.com> We have thousands of services connected to a daemon on every host via AF_UNIX dgram sockets, after they are moved into VM, we have to add a proxy to forward these communications from VM to host, because rewriting thousands of them is not practical. This proxy uses an AF_UNIX socket connected to services and a UDP socket to connect to the host. It is inefficient because data is copied between kernel space and user space twice, and we can not use splice() which only supports TCP. Therefore, we want to use sockmap to do the splicing without going to user-space at all (after the initial setup). Currently sockmap only fully supports TCP, UDP is partially supported as it is only allowed to add into sockmap. This patchset, as the second part of the original large patchset, extends sockmap with: 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support. On the high level, ->sendmsg_locked() and ->read_sock() are required for each protocol to support sockmap redirection, and in order to do sock proto update, a new ops ->update_proto() is introduced, which is also required to implement. A BPF ->recvmsg() is also needed to replace the original ->recvmsg() to retrieve skmsg. Please see each patch for more details. To see the big picture, the original patchset is available here: https://github.com/congwang/linux/tree/sockmap this patchset is also available: https://github.com/congwang/linux/tree/sockmap2 --- v2: separate from the original large patchset rebase to the latest bpf-next split UDP test case move inet_csk_has_ulp() check to tcp_bpf.c clean up udp_read_sock() Cong Wang (9): sock_map: introduce BPF_SK_SKB_VERDICT sock: introduce sk_prot->update_proto() udp: implement ->sendmsg_locked() udp: implement ->read_sock() for sockmap udp: add ->read_sock() and ->sendmsg_locked() to ipv6 skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() udp: implement udp_bpf_recvmsg() for sockmap sock_map: update sock type checks for UDP selftests/bpf: add a test case for udp sockmap include/linux/skmsg.h | 25 ++-- include/net/ipv6.h | 1 + include/net/sock.h | 3 + include/net/tcp.h | 3 +- include/net/udp.h | 4 + include/uapi/linux/bpf.h | 1 + kernel/bpf/syscall.c | 1 + net/core/skmsg.c | 113 +++++++++++++- net/core/sock_map.c | 52 ++++--- net/ipv4/af_inet.c | 2 + net/ipv4/tcp_bpf.c | 129 +++------------- net/ipv4/tcp_ipv4.c | 3 + net/ipv4/udp.c | 68 ++++++++- net/ipv4/udp_bpf.c | 78 +++++++++- net/ipv6/af_inet6.c | 2 + net/ipv6/tcp_ipv6.c | 3 + net/ipv6/udp.c | 30 +++- net/tls/tls_sw.c | 4 +- tools/bpf/bpftool/common.c | 1 + tools/bpf/bpftool/prog.c | 1 + tools/include/uapi/linux/bpf.h | 1 + .../selftests/bpf/prog_tests/sockmap_listen.c | 140 ++++++++++++++++++ .../selftests/bpf/progs/test_sockmap_listen.c | 22 +++ 23 files changed, 517 insertions(+), 170 deletions(-)