Message ID | 20250220020914.895431-1-almasrymina@google.com |
---|---|
Headers | show |
Series | Device memory TCP TX | expand |
On 02/20, Mina Almasry wrote: > Add support for devmem TX in ncdevmem. > > This is a combination of the ncdevmem from the devmem TCP series RFCv1 > which included the TX path, and work by Stan to include the netlink API > and refactored on top of his generic memory_provider support. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> > > --- > > v4: > - Add TX test to devmem.py (Paolo). > > v3: > - Update ncdevmem docs to run validation with RX-only and RX-with-TX. > - Fix build warnings (Stan). > - Make the validation expect new lines in the pattern so we can have the > TX path behave like netcat (Stan). > - Change ret to errno in error() calls (Stan). > - Handle the case where client_ip is not provided (Stan). > - Don't assume mid is <= 2000 (Stan). > > v2: > - make errors a static variable so that we catch instances where there > are less than 20 errors across different buffers. > - Fix the issue where the seed is reset to 0 instead of its starting > value 1. > - Use 1000ULL instead of 1000 to guard against overflow (Willem). > - Do not set POLLERR (Willem). > - Update the test to use the new interface where iov_base is the > dmabuf_offset. > - Update the test to send 2 iov instead of 1, so we get some test > coverage over sending multiple iovs at once. > - Print the ifindex the test is using, useful for debugging issues where > maybe the test may fail because the ifindex of the socket is different > from the dmabuf binding. > > --- > .../selftests/drivers/net/hw/devmem.py | 28 +- > .../selftests/drivers/net/hw/ncdevmem.c | 300 +++++++++++++++++- > 2 files changed, 312 insertions(+), 16 deletions(-) > > diff --git a/tools/testing/selftests/drivers/net/hw/devmem.py b/tools/testing/selftests/drivers/net/hw/devmem.py > index 1223f0f5c10c..3d4f7fc5e63f 100755 > --- a/tools/testing/selftests/drivers/net/hw/devmem.py > +++ b/tools/testing/selftests/drivers/net/hw/devmem.py > @@ -1,6 +1,7 @@ > #!/usr/bin/env python3 > # SPDX-License-Identifier: GPL-2.0 > > +from os import path > from lib.py import ksft_run, ksft_exit > from lib.py import ksft_eq, KsftSkipEx > from lib.py import NetDrvEpEnv > @@ -10,8 +11,7 @@ from lib.py import ksft_disruptive > > def require_devmem(cfg): > if not hasattr(cfg, "_devmem_probed"): > - port = rand_port() > - probe_command = f"./ncdevmem -f {cfg.ifname}" > + probe_command = f"{cfg.bin_local} -f {cfg.ifname}" > cfg._devmem_supported = cmd(probe_command, fail=False, shell=True).ret == 0 > cfg._devmem_probed = True > > @@ -25,18 +25,36 @@ def check_rx(cfg) -> None: > require_devmem(cfg) > > port = rand_port() > - listen_cmd = f"./ncdevmem -l -f {cfg.ifname} -s {cfg.v6} -p {port}" > + listen_cmd = f"{cfg.bin_local} -l -f {cfg.ifname} -s {cfg.v6} -p {port}" Commit de94e8697405 ("selftests: drv-net: store addresses in dict indexed by ipver") just went it, so v6 needs to be addr_v['6'] and remote_v6 needs to be remote_addr_v['6']. > > with bkg(listen_cmd) as socat: > wait_port_listen(port) > - cmd(f"echo -e \"hello\\nworld\"| socat -u - TCP6:[{cfg.v6}]:{port}", host=cfg.remote, shell=True) > + cmd(f"echo -e \"hello\\nworld\"| socat -u - TCP6:{cfg.v6}:{port},bind={cfg.remote_v6}:{port}", host=cfg.remote, shell=True) > + [..] > + ksft_eq(ncdevmem.stdout.strip(), "hello\nworld") s/ncdevmem/socat/ (or rename socat in the with block above) > +@ksft_disruptive > +def check_tx(cfg) -> None: [..] > + cfg.require_v6() This is also now require_ipver("6") I think.. Gonna try to run the selftest and see if anything else pops up...
On 02/20, Mina Almasry wrote: > Currently net_iovs support only pp ref counts, and do not support a > page ref equivalent. > > This is fine for the RX path as net_iovs are used exclusively with the > pp and only pp refcounting is needed there. The TX path however does not > use pp ref counts, thus, support for get_page/put_page equivalent is > needed for netmem. > > Support get_netmem/put_netmem. Check the type of the netmem before > passing it to page or net_iov specific code to obtain a page ref > equivalent. > > For dmabuf net_iovs, we obtain a ref on the underlying binding. This > ensures the entire binding doesn't disappear until all the net_iovs have > been put_netmem'ed. We do not need to track the refcount of individual > dmabuf net_iovs as we don't allocate/free them from a pool similar to > what the buddy allocator does for pages. > > This code is written to be extensible by other net_iov implementers. > get_netmem/put_netmem will check the type of the netmem and route it to > the correct helper: > > pages -> [get|put]_page() > dmabuf net_iovs -> net_devmem_[get|put]_net_iov() > new net_iovs -> new helpers > > Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> I remember you wanted to have more fine grained refcnt, but starting with this is good enough for me.
On 02/20, Mina Almasry wrote: > Augment dmabuf binding to be able to handle TX. Additional to all the RX > binding, we also create tx_vec needed for the TX path. > > Provide API for sendmsg to be able to send dmabufs bound to this device: > > - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. > - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. > > Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY > implementation, while disabling instances where MSG_ZEROCOPY falls back > to copying. > > We additionally pipe the binding down to the new > zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems > instead of the traditional page netmems. > > We also special case skb_frag_dma_map to return the dma-address of these > dmabuf net_iovs instead of attempting to map pages. > > Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat > of the implementation came from devmem TCP RFC v1[1], which included the > TX path, but Stan did all the rebasing on top of netmem/net_iov. > > Cc: Stanislav Fomichev <sdf@fomichev.me> > Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> > Signed-off-by: Mina Almasry <almasrymina@google.com> > > --- > > v4: > - Remove dmabuf_tx_cmsg definition and just use __u32 for the dma-buf id > (Willem). > - Check that iov_iter_type() is ITER_IOVEC in > zerocopy_fill_skb_from_iter() (Pavel). > - Fix binding->tx_vec not being freed on error paths (Paolo). > - Make devmem patch mutually exclusive with msg->ubuf_info path (Pavel). > - Check that MSG_ZEROCOPY and SOCK_ZEROCOPY are provided when > sockc.dmabuf_id is provided. > - Don't mm_account_pinned_pages() on devmem TX (Pavel). > > v3: > - Use kvmalloc_array instead of kcalloc (Stan). > - Fix unreachable code warning (Simon). > > v2: > - Remove dmabuf_offset from the dmabuf cmsg. > - Update zerocopy_fill_skb_from_devmem to interpret the > iov_base/iter_iov_addr as the offset into the dmabuf to send from > (Stan). > - Remove the confusing binding->tx_iter which is not needed if we > interpret the iov_base/iter_iov_addr as offset into the dmabuf (Stan). > - Remove check for binding->sgt and binding->sgt->nents in dmabuf > binding. > - Simplify the calculation of binding->tx_vec. > - Check in net_devmem_get_binding that the binding we're returning > has ifindex matching the sending socket (Willem). > > --- > include/linux/skbuff.h | 17 +++- > include/net/sock.h | 1 + > net/core/datagram.c | 48 +++++++++++- > net/core/devmem.c | 100 ++++++++++++++++++++++-- > net/core/devmem.h | 41 +++++++++- > net/core/netdev-genl.c | 64 ++++++++++++++- > net/core/skbuff.c | 18 +++-- > net/core/sock.c | 6 ++ > net/ipv4/ip_output.c | 3 +- > net/ipv4/tcp.c | 46 ++++++++--- > net/ipv6/ip6_output.c | 3 +- > net/vmw_vsock/virtio_transport_common.c | 5 +- > 12 files changed, 309 insertions(+), 43 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index bb2b751d274a..b8783aa94c1e 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -1707,13 +1707,16 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset) > extern const struct ubuf_info_ops msg_zerocopy_ubuf_ops; > > struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, > - struct ubuf_info *uarg); > + struct ubuf_info *uarg, bool devmem); > > void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref); > > +struct net_devmem_dmabuf_binding; > + > int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, > struct sk_buff *skb, struct iov_iter *from, > - size_t length); > + size_t length, > + struct net_devmem_dmabuf_binding *binding); > > int zerocopy_fill_skb_from_iter(struct sk_buff *skb, > struct iov_iter *from, size_t length); > @@ -1721,12 +1724,14 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, > static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb, > struct msghdr *msg, int len) > { > - return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len); > + return __zerocopy_sg_from_iter(msg, skb->sk, skb, &msg->msg_iter, len, > + NULL); > } > > int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb, > struct msghdr *msg, int len, > - struct ubuf_info *uarg); > + struct ubuf_info *uarg, > + struct net_devmem_dmabuf_binding *binding); > > /* Internal */ > #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) > @@ -3697,6 +3702,10 @@ static inline dma_addr_t __skb_frag_dma_map(struct device *dev, > size_t offset, size_t size, > enum dma_data_direction dir) > { > + if (skb_frag_is_net_iov(frag)) { > + return netmem_to_net_iov(frag->netmem)->dma_addr + offset + > + frag->offset; > + } > return dma_map_page(dev, skb_frag_page(frag), > skb_frag_off(frag) + offset, size, dir); > } > diff --git a/include/net/sock.h b/include/net/sock.h > index fac65ed30983..aac7e9d92ba5 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -1823,6 +1823,7 @@ struct sockcm_cookie { > u32 tsflags; > u32 ts_opt_id; > u32 priority; > + u32 dmabuf_id; > }; > > static inline void sockcm_init(struct sockcm_cookie *sockc, > diff --git a/net/core/datagram.c b/net/core/datagram.c > index f0693707aece..3e60c83305cc 100644 > --- a/net/core/datagram.c > +++ b/net/core/datagram.c > @@ -63,6 +63,8 @@ > #include <net/busy_poll.h> > #include <crypto/hash.h> > > +#include "devmem.h" > + > /* > * Is a socket 'connection oriented' ? > */ > @@ -692,9 +694,49 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, > return 0; > } > > +static int > +zerocopy_fill_skb_from_devmem(struct sk_buff *skb, struct iov_iter *from, > + int length, > + struct net_devmem_dmabuf_binding *binding) > +{ > + int i = skb_shinfo(skb)->nr_frags; > + size_t virt_addr, size, off; > + struct net_iov *niov; > + > + /* Devmem filling works by taking an IOVEC from the user where the > + * iov_addrs are interpreted as an offset in bytes into the dma-buf to > + * send from. We do not support other iter types. > + */ > + if (iov_iter_type(from) != ITER_IOVEC) > + return -EINVAL; It seems like the caller (skb_zerocopy_iter_stream) needs to special case EINVAL. Right now it only expects EFAULT or EMSGSIZE (and backtracks). The rest of errors are ignored and it reports the number of bytes copied (which will be zero in your case, but still not what we want). Or maybe this check needs to happen earlier? In tcp_sendmsg_locked? > + > + while (length && iov_iter_count(from)) { > + if (i == MAX_SKB_FRAGS) > + return -EMSGSIZE; > + > + virt_addr = (size_t)iter_iov_addr(from); > + niov = net_devmem_get_niov_at(binding, virt_addr, &off, &size); > + if (!niov) > + return -EFAULT; > + > + size = min_t(size_t, size, length); > + size = min_t(size_t, size, iter_iov_len(from)); > + > + get_netmem(net_iov_to_netmem(niov)); > + skb_add_rx_frag_netmem(skb, i, net_iov_to_netmem(niov), off, > + size, PAGE_SIZE); > + iov_iter_advance(from, size); > + length -= size; > + i++; > + } > + > + return 0; > +} > + > int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, > struct sk_buff *skb, struct iov_iter *from, > - size_t length) > + size_t length, > + struct net_devmem_dmabuf_binding *binding) > { > unsigned long orig_size = skb->truesize; > unsigned long truesize; > @@ -702,6 +744,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, > > if (msg && msg->msg_ubuf && msg->sg_from_iter) > ret = msg->sg_from_iter(skb, from, length); > + else if (unlikely(binding)) > + ret = zerocopy_fill_skb_from_devmem(skb, from, length, binding); > else > ret = zerocopy_fill_skb_from_iter(skb, from, length); > > @@ -735,7 +779,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from) > if (skb_copy_datagram_from_iter(skb, 0, from, copy)) > return -EFAULT; > > - return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U); > + return __zerocopy_sg_from_iter(NULL, NULL, skb, from, ~0U, NULL); > } > EXPORT_SYMBOL(zerocopy_sg_from_iter); > > diff --git a/net/core/devmem.c b/net/core/devmem.c > index b1aafc66ebb7..2e576f80b1d8 100644 > --- a/net/core/devmem.c > +++ b/net/core/devmem.c > @@ -17,6 +17,7 @@ > #include <net/netdev_rx_queue.h> > #include <net/page_pool/helpers.h> > #include <net/page_pool/memory_provider.h> > +#include <net/sock.h> > #include <trace/events/page_pool.h> > > #include "devmem.h" > @@ -73,8 +74,10 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) > dma_buf_detach(binding->dmabuf, binding->attachment); > dma_buf_put(binding->dmabuf); > xa_destroy(&binding->bound_rxqs); > + kvfree(binding->tx_vec); > kfree(binding); > } > +EXPORT_SYMBOL(__net_devmem_dmabuf_binding_free); > > struct net_iov * > net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) > @@ -119,6 +122,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) > unsigned long xa_idx; > unsigned int rxq_idx; > > + xa_erase(&net_devmem_dmabuf_bindings, binding->id); > + > + /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the > + * erase. > + */ > + synchronize_net(); > + > if (binding->list.next) > list_del(&binding->list); > > @@ -133,8 +143,6 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) > WARN_ON(netdev_rx_queue_restart(binding->dev, rxq_idx)); > } > > - xa_erase(&net_devmem_dmabuf_bindings, binding->id); > - > net_devmem_dmabuf_binding_put(binding); > } > > @@ -197,8 +205,9 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, > } > > struct net_devmem_dmabuf_binding * > -net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, > - struct netlink_ext_ack *extack) > +net_devmem_bind_dmabuf(struct net_device *dev, > + enum dma_data_direction direction, > + unsigned int dmabuf_fd, struct netlink_ext_ack *extack) > { > struct net_devmem_dmabuf_binding *binding; > static u32 id_alloc_next; > @@ -241,7 +250,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, > } > > binding->sgt = dma_buf_map_attachment_unlocked(binding->attachment, > - DMA_FROM_DEVICE); > + direction); > if (IS_ERR(binding->sgt)) { > err = PTR_ERR(binding->sgt); > NL_SET_ERR_MSG(extack, "Failed to map dmabuf attachment"); > @@ -252,13 +261,23 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, > * binding can be much more flexible than that. We may be able to > * allocate MTU sized chunks here. Leave that for future work... > */ > - binding->chunk_pool = > - gen_pool_create(PAGE_SHIFT, dev_to_node(&dev->dev)); > + binding->chunk_pool = gen_pool_create(PAGE_SHIFT, > + dev_to_node(&dev->dev)); > if (!binding->chunk_pool) { > err = -ENOMEM; > goto err_unmap; > } > > + if (direction == DMA_TO_DEVICE) { > + binding->tx_vec = kvmalloc_array(dmabuf->size / PAGE_SIZE, > + sizeof(struct net_iov *), > + GFP_KERNEL); > + if (!binding->tx_vec) { > + err = -ENOMEM; > + goto err_free_chunks; > + } > + } > + > virtual = 0; > for_each_sgtable_dma_sg(binding->sgt, sg, sg_idx) { > dma_addr_t dma_addr = sg_dma_address(sg); > @@ -300,6 +319,8 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, > niov->owner = &owner->area; > page_pool_set_dma_addr_netmem(net_iov_to_netmem(niov), > net_devmem_get_dma_addr(niov)); > + if (direction == DMA_TO_DEVICE) > + binding->tx_vec[owner->area.base_virtual / PAGE_SIZE + i] = niov; > } > > virtual += len; > @@ -311,6 +332,9 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, > gen_pool_for_each_chunk(binding->chunk_pool, > net_devmem_dmabuf_free_chunk_owner, NULL); > gen_pool_destroy(binding->chunk_pool); > + > + if (direction == DMA_TO_DEVICE) > + kvfree(binding->tx_vec); nit: we might unconditionally kvfree() here? worst case we do kvfree(NULL)
On 02/20, Mina Almasry wrote: > Add documentation outlining the usage and details of the devmem TCP TX > API. > > Signed-off-by: Mina Almasry <almasrymina@google.com> With a few nits below: Acked-by: Stanislav Fomichev <sdf@fomichev.me> > > --- > > v4: > - Mention SO_BINDTODEVICE is recommended (me/Pavel). > > v2: > - Update documentation for iov_base is the dmabuf offset (Stan) > > --- > Documentation/networking/devmem.rst | 150 +++++++++++++++++++++++++++- > 1 file changed, 146 insertions(+), 4 deletions(-) > > diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst > index d95363645331..10928a5f912f 100644 > --- a/Documentation/networking/devmem.rst > +++ b/Documentation/networking/devmem.rst > @@ -62,15 +62,15 @@ More Info > https://lore.kernel.org/netdev/20240831004313.3713467-1-almasrymina@google.com/ > > > -Interface > -========= > +RX Interface > +============ > > > Example > ------- > > -tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up > -the RX path of this API. > +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_server shows an example of > +setting up the RX path of this API. > > > NIC Setup > @@ -235,6 +235,148 @@ can be less than the tokens provided by the user in case of: > (a) an internal kernel leak bug. > (b) the user passed more than 1024 frags. > > +TX Interface > +============ > + > + > +Example > +------- > + > +./tools/testing/selftests/drivers/net/hw/ncdevmem:do_client shows an example of > +setting up the TX path of this API. > + > + > +NIC Setup > +--------- > + > +The user must bind a TX dmabuf to a given NIC using the netlink API:: > + > + struct netdev_bind_tx_req *req = NULL; > + struct netdev_bind_tx_rsp *rsp = NULL; > + struct ynl_error yerr; > + > + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); > + > + req = netdev_bind_tx_req_alloc(); > + netdev_bind_tx_req_set_ifindex(req, ifindex); > + netdev_bind_tx_req_set_fd(req, dmabuf_fd); > + > + rsp = netdev_bind_tx(*ys, req); > + > + tx_dmabuf_id = rsp->id; > + > + > +The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf > +that has been bound. > + > +The user can unbind the dmabuf from the netdevice by closing the netlink socket > +that established the binding. We do this so that the binding is automatically > +unbound even if the userspace process crashes. > + > +Note that any reasonably well-behaved dmabuf from any exporter should work with > +devmem TCP, even if the dmabuf is not actually backed by devmem. An example of > +this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. > + > +Socket Setup > +------------ > + > +The user application must use MSG_ZEROCOPY flag when sending devmem TCP. Devmem > +cannot be copied by the kernel, so the semantics of the devmem TX are similar > +to the semantics of MSG_ZEROCOPY. > + > + setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &opt, sizeof(opt)); > + > +It is also recommended that the user binds the TX socket to the same interface > +the dma-buf has been bound to via SO_BINDTODEVICE. > + > + setsockopt(socket_fd, SOL_SOCKET, SO_BINDTODEVICE, ifname, strlen(ifname) + 1); > + > + > +Sending data > +-------------- ^^ extra junk > +Devmem data is sent using the SCM_DEVMEM_DMABUF cmsg. > + > +The user should create a msghdr where, > + > +iov_base is set to the offset into the dmabuf to start sending from. > +iov_len is set to the number of bytes to be sent from the dmabuf. nit: maybe bullet point the above? The user should create a msghdr with the following set of msg_iov: * iov_base is set to the offset into the dmabuf to start sending from * iov_len is set to the number of bytes to be sent from the dmabuf