[net-next,v2,0/6] net: support QUIC crypto

Message ID	20220817200940.1656747-1-adel.abushaev@gmail.com
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Adel Abouchaev <adel.abushaev@gmail.com> To: kuba@kernel.org Cc: davem@davemloft.net, edumazet@google.com, pabeni@redhat.com, corbet@lwn.net, dsahern@kernel.org, shuah@kernel.org, imagedong@tencent.com, netdev@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [net-next v2 0/6] net: support QUIC crypto Date: Wed, 17 Aug 2022 13:09:34 -0700 Message-Id: <20220817200940.1656747-1-adel.abushaev@gmail.com> In-Reply-To: <Adel Abouchaev <adel.abushaev@gmail.com> References: <Adel Abouchaev <adel.abushaev@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net: support QUIC crypto \| expand [net-next,v2,0/6] net: support QUIC crypto [net-next,v2,2/6] Define QUIC specific constants, control and data plane structures [net-next,v2,4/6] Implement QUIC offload functions [net-next,v2,6/6] Add self tests for ULP operations, flow setup and crypto tests

Adel Abouchaev Aug. 17, 2022, 8:09 p.m. UTC

QUIC requires end to end encryption of the data. The application usually
prepares the data in clear text, encrypts and calls send() which implies
multiple copies of the data before the packets hit the networking stack.
Similar to kTLS, QUIC kernel offload of cryptography reduces the memory
pressure by reducing the number of copies.

The scope of kernel support is limited to the symmetric cryptography,
leaving the handshake to the user space library. For QUIC in particular,
the application packets that require symmetric cryptography are the 1RTT
packets with short headers. Kernel will encrypt the application packets
on transmission and decrypt on receive. This series implements Tx only,
because in QUIC server applications Tx outweighs Rx by orders of
magnitude.

Supporting the combination of QUIC and GSO requires the application to
correctly place the data and the kernel to correctly slice it. The
encryption process appends an arbitrary number of bytes (tag) to the end
of the message to authenticate it. The GSO value should include this
overhead, the offload would then subtract the tag size to parse the
input on Tx before chunking and encrypting it.

With the kernel cryptography, the buffer copy operation is conjoined
with the encryption operation. The memory bandwidth is reduced by 5-8%.
When devices supporting QUIC encryption in hardware come to the market,
we will be able to free further 7% of CPU utilization which is used
today for crypto operations.

Adel Abouchaev (6):
  Documentation on QUIC kernel Tx crypto.
  Define QUIC specific constants, control and data plane structures
  Add UDP ULP operations, initialization and handling prototype
    functions.
  Implement QUIC offload functions
  Add flow counters and Tx processing error counter
  Add self tests for ULP operations, flow setup and crypto tests

 Documentation/networking/index.rst     |    1 +
 Documentation/networking/quic.rst      |  185 ++++
 include/net/inet_sock.h                |    2 +
 include/net/netns/mib.h                |    3 +
 include/net/quic.h                     |   63 ++
 include/net/snmp.h                     |    6 +
 include/net/udp.h                      |   33 +
 include/uapi/linux/quic.h              |   60 +
 include/uapi/linux/snmp.h              |    9 +
 include/uapi/linux/udp.h               |    4 +
 net/Kconfig                            |    1 +
 net/Makefile                           |    1 +
 net/ipv4/Makefile                      |    3 +-
 net/ipv4/udp.c                         |   15 +
 net/ipv4/udp_ulp.c                     |  192 ++++
 net/quic/Kconfig                       |   16 +
 net/quic/Makefile                      |    8 +
 net/quic/quic_main.c                   | 1417 ++++++++++++++++++++++++
 net/quic/quic_proc.c                   |   45 +
 tools/testing/selftests/net/.gitignore |    4 +-
 tools/testing/selftests/net/Makefile   |    3 +-
 tools/testing/selftests/net/quic.c     | 1153 +++++++++++++++++++
 tools/testing/selftests/net/quic.sh    |   46 +
 23 files changed, 3267 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/networking/quic.rst
 create mode 100644 include/net/quic.h
 create mode 100644 include/uapi/linux/quic.h
 create mode 100644 net/ipv4/udp_ulp.c
 create mode 100644 net/quic/Kconfig
 create mode 100644 net/quic/Makefile
 create mode 100644 net/quic/quic_main.c
 create mode 100644 net/quic/quic_proc.c
 create mode 100644 tools/testing/selftests/net/quic.c
 create mode 100755 tools/testing/selftests/net/quic.sh


base-commit: fd78d07c7c35de260eb89f1be4a1e7487b8092ad

Bagas Sanjaya Aug. 18, 2022, 2:53 a.m. UTC | #1

On Wed, Aug 17, 2022 at 01:09:35PM -0700, Adel Abouchaev wrote:
> Add documentation for kernel QUIC code.
> 

The documentation LGTM.

Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>

Xin Long Aug. 24, 2022, 6:29 p.m. UTC | #2

On Wed, Aug 17, 2022 at 4:11 PM Adel Abouchaev <adel.abushaev@gmail.com> wrote:
>
> QUIC requires end to end encryption of the data. The application usually
> prepares the data in clear text, encrypts and calls send() which implies
> multiple copies of the data before the packets hit the networking stack.
> Similar to kTLS, QUIC kernel offload of cryptography reduces the memory
> pressure by reducing the number of copies.
>
> The scope of kernel support is limited to the symmetric cryptography,
> leaving the handshake to the user space library. For QUIC in particular,
> the application packets that require symmetric cryptography are the 1RTT
> packets with short headers. Kernel will encrypt the application packets
> on transmission and decrypt on receive. This series implements Tx only,
> because in QUIC server applications Tx outweighs Rx by orders of
> magnitude.
>
> Supporting the combination of QUIC and GSO requires the application to
> correctly place the data and the kernel to correctly slice it. The
> encryption process appends an arbitrary number of bytes (tag) to the end
> of the message to authenticate it. The GSO value should include this
> overhead, the offload would then subtract the tag size to parse the
> input on Tx before chunking and encrypting it.
>
> With the kernel cryptography, the buffer copy operation is conjoined
> with the encryption operation. The memory bandwidth is reduced by 5-8%.
> When devices supporting QUIC encryption in hardware come to the market,
> we will be able to free further 7% of CPU utilization which is used
> today for crypto operations.
>
> Adel Abouchaev (6):
>   Documentation on QUIC kernel Tx crypto.
>   Define QUIC specific constants, control and data plane structures
>   Add UDP ULP operations, initialization and handling prototype
>     functions.
>   Implement QUIC offload functions
>   Add flow counters and Tx processing error counter
>   Add self tests for ULP operations, flow setup and crypto tests
>
>  Documentation/networking/index.rst     |    1 +
>  Documentation/networking/quic.rst      |  185 ++++
>  include/net/inet_sock.h                |    2 +
>  include/net/netns/mib.h                |    3 +
>  include/net/quic.h                     |   63 ++
>  include/net/snmp.h                     |    6 +
>  include/net/udp.h                      |   33 +
>  include/uapi/linux/quic.h              |   60 +
>  include/uapi/linux/snmp.h              |    9 +
>  include/uapi/linux/udp.h               |    4 +
>  net/Kconfig                            |    1 +
>  net/Makefile                           |    1 +
>  net/ipv4/Makefile                      |    3 +-
>  net/ipv4/udp.c                         |   15 +
>  net/ipv4/udp_ulp.c                     |  192 ++++
>  net/quic/Kconfig                       |   16 +
>  net/quic/Makefile                      |    8 +
>  net/quic/quic_main.c                   | 1417 ++++++++++++++++++++++++
>  net/quic/quic_proc.c                   |   45 +
>  tools/testing/selftests/net/.gitignore |    4 +-
>  tools/testing/selftests/net/Makefile   |    3 +-
>  tools/testing/selftests/net/quic.c     | 1153 +++++++++++++++++++
>  tools/testing/selftests/net/quic.sh    |   46 +
>  23 files changed, 3267 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/networking/quic.rst
>  create mode 100644 include/net/quic.h
>  create mode 100644 include/uapi/linux/quic.h
>  create mode 100644 net/ipv4/udp_ulp.c
>  create mode 100644 net/quic/Kconfig
>  create mode 100644 net/quic/Makefile
>  create mode 100644 net/quic/quic_main.c
>  create mode 100644 net/quic/quic_proc.c
>  create mode 100644 tools/testing/selftests/net/quic.c
>  create mode 100755 tools/testing/selftests/net/quic.sh
>
>
> base-commit: fd78d07c7c35de260eb89f1be4a1e7487b8092ad
> --
> 2.30.2
>
Hi, Adel,

I don't see how the key update(rfc9001#section-6) is handled on the TX
path, which is not using TLS Key update, and "Key Phase" indicates
which key will be used after rekeying. Also, I think it is almost
impossible to handle the peer rekeying on the RX path either based on
your current model in the future.

The patch seems to get the crypto_ctx by doing a connection hash table
lookup in the sendmsg(), which is not good from the performance side.
One QUIC connection can go over multiple UDP sockets, but I don't
think one socket can be used by multiple QUIC connections. So why not
save the ctx in the socket instead?

The patch is to reduce the copying operations between user space and
the kernel. I might miss something in your user space code, but the
msg to send is *already packed* into the Stream Frame in user space,
what's the difference if you encrypt it in userspace and then
sendmsg(udp_sk) with zero-copy to the kernel.

Didn't really understand the "GSO" you mentioned, as I don't see any
code about kernel GSO, I guess it's just "Fragment size", right?
BTW, it‘s not common to use "//" for the kernel annotation.

I'm not sure if it's worth adding a ULP layer over UDP for this QUIC
TX only. Honestly, I'm more supporting doing a full QUIC stack in the
kernel independently with socket APIs to use it:
https://github.com/lxin/tls_hs.

Thanks.

Adel Abouchaev Aug. 24, 2022, 11:09 p.m. UTC | #3

On 8/24/22 11:29 AM, Xin Long wrote:
> On Wed, Aug 17, 2022 at 4:11 PM Adel Abouchaev <adel.abushaev@gmail.com> wrote:
>> QUIC requires end to end encryption of the data. The application usually
>> prepares the data in clear text, encrypts and calls send() which implies
>> multiple copies of the data before the packets hit the networking stack.
>> Similar to kTLS, QUIC kernel offload of cryptography reduces the memory
>> pressure by reducing the number of copies.
>>
>> The scope of kernel support is limited to the symmetric cryptography,
>> leaving the handshake to the user space library. For QUIC in particular,
>> the application packets that require symmetric cryptography are the 1RTT
>> packets with short headers. Kernel will encrypt the application packets
>> on transmission and decrypt on receive. This series implements Tx only,
>> because in QUIC server applications Tx outweighs Rx by orders of
>> magnitude.
>>
>> Supporting the combination of QUIC and GSO requires the application to
>> correctly place the data and the kernel to correctly slice it. The
>> encryption process appends an arbitrary number of bytes (tag) to the end
>> of the message to authenticate it. The GSO value should include this
>> overhead, the offload would then subtract the tag size to parse the
>> input on Tx before chunking and encrypting it.
>>
>> With the kernel cryptography, the buffer copy operation is conjoined
>> with the encryption operation. The memory bandwidth is reduced by 5-8%.
>> When devices supporting QUIC encryption in hardware come to the market,
>> we will be able to free further 7% of CPU utilization which is used
>> today for crypto operations.
>>
>> Adel Abouchaev (6):
>>    Documentation on QUIC kernel Tx crypto.
>>    Define QUIC specific constants, control and data plane structures
>>    Add UDP ULP operations, initialization and handling prototype
>>      functions.
>>    Implement QUIC offload functions
>>    Add flow counters and Tx processing error counter
>>    Add self tests for ULP operations, flow setup and crypto tests
>>
>>   Documentation/networking/index.rst     |    1 +
>>   Documentation/networking/quic.rst      |  185 ++++
>>   include/net/inet_sock.h                |    2 +
>>   include/net/netns/mib.h                |    3 +
>>   include/net/quic.h                     |   63 ++
>>   include/net/snmp.h                     |    6 +
>>   include/net/udp.h                      |   33 +
>>   include/uapi/linux/quic.h              |   60 +
>>   include/uapi/linux/snmp.h              |    9 +
>>   include/uapi/linux/udp.h               |    4 +
>>   net/Kconfig                            |    1 +
>>   net/Makefile                           |    1 +
>>   net/ipv4/Makefile                      |    3 +-
>>   net/ipv4/udp.c                         |   15 +
>>   net/ipv4/udp_ulp.c                     |  192 ++++
>>   net/quic/Kconfig                       |   16 +
>>   net/quic/Makefile                      |    8 +
>>   net/quic/quic_main.c                   | 1417 ++++++++++++++++++++++++
>>   net/quic/quic_proc.c                   |   45 +
>>   tools/testing/selftests/net/.gitignore |    4 +-
>>   tools/testing/selftests/net/Makefile   |    3 +-
>>   tools/testing/selftests/net/quic.c     | 1153 +++++++++++++++++++
>>   tools/testing/selftests/net/quic.sh    |   46 +
>>   23 files changed, 3267 insertions(+), 3 deletions(-)
>>   create mode 100644 Documentation/networking/quic.rst
>>   create mode 100644 include/net/quic.h
>>   create mode 100644 include/uapi/linux/quic.h
>>   create mode 100644 net/ipv4/udp_ulp.c
>>   create mode 100644 net/quic/Kconfig
>>   create mode 100644 net/quic/Makefile
>>   create mode 100644 net/quic/quic_main.c
>>   create mode 100644 net/quic/quic_proc.c
>>   create mode 100644 tools/testing/selftests/net/quic.c
>>   create mode 100755 tools/testing/selftests/net/quic.sh
>>
>>
>> base-commit: fd78d07c7c35de260eb89f1be4a1e7487b8092ad
>> --
>> 2.30.2
>>
> Hi, Adel,
>
> I don't see how the key update(rfc9001#section-6) is handled on the TX
> path, which is not using TLS Key update, and "Key Phase" indicates
> which key will be used after rekeying. Also, I think it is almost
> impossible to handle the peer rekeying on the RX path either based on
> your current model in the future.

The update is not present in these patches, but it is an important part 
of the QUIC functionality. As this patch is only storing a single key, 
you are correct that this approach does not handle the key rotation. To 
implement re-keying on Tx and on Rx a rolling secret will need to be 
stored in kernel. In that case, the subsequent 1RTT (Application space) 
keys will be refreshed by the kernel. After all, when the hardware is 
mature enough to support QUIC encryption and decryption - the secret 
will need to be kept in the hardware to react on time on Rx, especially. 
Tx path could solicit the re-key at any point or by the exhaustion of 
the counter of GCM (packet number in this case). The RFC expects the 
implementation to retain 2 keys, at least, while keeping 3 (old, current 
and next) is not prohibited either. Keeping more is not necessary.

>
> The patch seems to get the crypto_ctx by doing a connection hash table
> lookup in the sendmsg(), which is not good from the performance side.
> One QUIC connection can go over multiple UDP sockets, but I don't
> think one socket can be used by multiple QUIC connections. So why not
> save the ctx in the socket instead?
A single socket could have multiple connections originated from it, 
having different destinations, if the socket is not connected. An 
optimization could be made for connected sockets to cache the context 
and save time on a lookup. The measurement of kernel operations timing 
did not reveal a significant amount of time spent in this lookup due to 
a relatively small number of connections per socket in general. A shared 
table across multiple sockets might experience a different performance 
grading.
>
> The patch is to reduce the copying operations between user space and
> the kernel. I might miss something in your user space code, but the
> msg to send is *already packed* into the Stream Frame in user space,
> what's the difference if you encrypt it in userspace and then
> sendmsg(udp_sk) with zero-copy to the kernel.
It is possible to do it this way. Zero-copy works best with packet sizes 
starting at 32K and larger.  Anything less than that would consume the 
improvements of zero-copy by zero-copy pre/post operations and needs to 
align memory. The other possible obstacle would be that eventual support 
of QUIC encryption and decryption in hardware would integrate well with 
this current approach.
>
> Didn't really understand the "GSO" you mentioned, as I don't see any
> code about kernel GSO, I guess it's just "Fragment size", right?
> BTW, it‘s not common to use "//" for the kernel annotation.
Once the payload arrives into the kernel, the GSO on the interface would 
instruct L3/L4 stack on fragmentation. In this case, the plaintext QUIC 
packets should be aligned on the GSO marks less the tag size that would 
be added by encryption. For GSO size 1000, the QUIC packets in the batch 
for transmission should all be 984 bytes long, except maybe the last 
one. Once the tag is attached, the new size of 1000 will correctly split 
the QUIC packets further down the stack for transmission in individual 
IP/UDP packets. The code is also saving processing time by sending all 
packets at once to UDP in a single call, when GSO is enabled.
>
> I'm not sure if it's worth adding a ULP layer over UDP for this QUIC
> TX only. Honestly, I'm more supporting doing a full QUIC stack in the
> kernel independently with socket APIs to use it:
> https://github.com/lxin/tls_hs.
>
> Thanks.

Willem de Bruijn Sept. 25, 2022, 6:04 p.m. UTC | #4

> >
> > The patch seems to get the crypto_ctx by doing a connection hash table
> > lookup in the sendmsg(), which is not good from the performance side.
> > One QUIC connection can go over multiple UDP sockets, but I don't
> > think one socket can be used by multiple QUIC connections. So why not
> > save the ctx in the socket instead?
> A single socket could have multiple connections originated from it,
> having different destinations, if the socket is not connected. An
> optimization could be made for connected sockets to cache the context
> and save time on a lookup. The measurement of kernel operations timing
> did not reveal a significant amount of time spent in this lookup due to
> a relatively small number of connections per socket in general. A shared
> table across multiple sockets might experience a different performance
> grading.

I'm late to this patch series, sorry. High quality implementation. I
have a few design questions similar to Xin.

If multiplexing, instead of looking up a connection by { address, port
variable length connection ID }, perhaps return a connection table
index on setsockopt and use that in sendmsg.

> >
> > The patch is to reduce the copying operations between user space and
> > the kernel. I might miss something in your user space code, but the
> > msg to send is *already packed* into the Stream Frame in user space,
> > what's the difference if you encrypt it in userspace and then
> > sendmsg(udp_sk) with zero-copy to the kernel.
> It is possible to do it this way. Zero-copy works best with packet sizes
> starting at 32K and larger.  Anything less than that would consume the
> improvements of zero-copy by zero-copy pre/post operations and needs to
> align memory.

Part of the cost of MSG_ZEROCOPY is in mapping and unmapping user
pages. This series re-implements that with its own get_user_pages.
That is duplicative non-trivial code. And it will incur the same cost.
What this implementation saves is the (indeed non-trivial)
asynchronous completion notification over the error queue.

The cover letter gives some performance numbers against a userspace
implementation that has to copy from user to kernel. It might be more
even to compare against an implementation using MSG_ZEROCOPY and
UDP_SEGMENT. A userspace crypto implementation may have other benefits
compared to a kernel implementation, such as not having to convert to
crypto API scatter-gather arrays and back to network structures.

A few related points

- The implementation support multiplexed connections, but only one
crypto sendmsg can be outstanding at any time:

  + /**
  + * To synchronize concurrent sendmsg() requests through the same socket
  + * and protect preallocated per-context memory.
  + **/
  + struct mutex sendmsg_mux;

That is quite limiting for production workloads.

- Crypto operations are also executed synchronously, using
crypto_wait_req after each operationn. This limits throughput by using
at most one core per UDP socket. And adds sendmsg latency (which may
or may not be important to the application). Wireguard shows an
example of how to parallelize software crypto across cores.

- The implementation avoids dynamic allocation of cipher text pages by
using a single ctx->cipher_page. This is protected by sendmsg_mux (see
above). Is that safe when packets leave the protocol stack and are
then held in a qdisc or when being processed by the NIC?
quic_sendmsg_locked will return, but the cipher page is not free to
reuse yet.

- The real benefit of kernel QUIC will come from HW offload. Would it
be better to avoid the complexity of an in-kernel software
implementation and only focus on HW offload? Basically, pass the
plaintext QUIC packets over a standard UDP socket and alongside in a
cmsg pass either an index into a HW security association database or
the immediate { key, iv } connection_info (for stateless sockets), to
be encoded into the descriptor by the device driver.

- With such a simpler path, could we avoid introducing ULP and just
have udp [gs]etsockopt CRYPTO_STATE. Where QUIC is the only defined
state type yet.

- Small aside: as the series introduces new APIs with non-trivial
parsing in the kernel, it's good to run a fuzzer like syzkaller on it
(if not having done so yet).

> The other possible obstacle would be that eventual support
> of QUIC encryption and decryption in hardware would integrate well with
> this current approach.
> >
> > Didn't really understand the "GSO" you mentioned, as I don't see any
> > code about kernel GSO, I guess it's just "Fragment size", right?
> > BTW, it‘s not common to use "//" for the kernel annotation.

minor point: fragment has meaning in IPv4. For GSO, prefer gso_size.

> Once the payload arrives into the kernel, the GSO on the interface would
> instruct L3/L4 stack on fragmentation. In this case, the plaintext QUIC
> packets should be aligned on the GSO marks less the tag size that would
> be added by encryption. For GSO size 1000, the QUIC packets in the batch
> for transmission should all be 984 bytes long, except maybe the last
> one. Once the tag is attached, the new size of 1000 will correctly split
> the QUIC packets further down the stack for transmission in individual
> IP/UDP packets. The code is also saving processing time by sending all
> packets at once to UDP in a single call, when GSO is enabled.
> >
> > I'm not sure if it's worth adding a ULP layer over UDP for this QUIC
> > TX only. Honestly, I'm more supporting doing a full QUIC stack in the
> > kernel independently with socket APIs to use it:
> > https://github.com/lxin/tls_hs.
> >
> > Thanks.

Adel Abouchaev Sept. 27, 2022, 4:44 p.m. UTC | #5

On 9/25/22 11:04 AM, Willem de Bruijn wrote:
>>> The patch seems to get the crypto_ctx by doing a connection hash table
>>> lookup in the sendmsg(), which is not good from the performance side.
>>> One QUIC connection can go over multiple UDP sockets, but I don't
>>> think one socket can be used by multiple QUIC connections. So why not
>>> save the ctx in the socket instead?
>> A single socket could have multiple connections originated from it,
>> having different destinations, if the socket is not connected. An
>> optimization could be made for connected sockets to cache the context
>> and save time on a lookup. The measurement of kernel operations timing
>> did not reveal a significant amount of time spent in this lookup due to
>> a relatively small number of connections per socket in general. A shared
>> table across multiple sockets might experience a different performance
>> grading.
> I'm late to this patch series, sorry. High quality implementation. I
> have a few design questions similar to Xin.
>
> If multiplexing, instead of looking up a connection by { address, port
> variable length connection ID }, perhaps return a connection table
> index on setsockopt and use that in sendmsg.


It was deliberate to not to return anything other than 0 from 
setsockopt() as defined in the spec for the function. Despite that it 
says "shall", the doc says that 0 is the only value for successful 
operation. This was the reason not to use setsockopt() for any 
bidirectional transfers of data and or status. A more sophisticated 
approach with netlink sockets would be more suitable for it. The second 
reason is the API asymmetry for Tx and Rx which will be introduced - the 
Rx will still need to match on the address, port and cid. The third 
reason is that in current implementations there are no more than a few 
connections per socket, which does not abuse the rhashtable that does a 
lookup, although it takes time to hash the key into a hash for a seek. 
The performance measurement ran against the runtime and did not flag 
this path as underperforming either, there were other parts that 
substantially add to the runtime, not the key lookup though.


>>> The patch is to reduce the copying operations between user space and
>>> the kernel. I might miss something in your user space code, but the
>>> msg to send is *already packed* into the Stream Frame in user space,
>>> what's the difference if you encrypt it in userspace and then
>>> sendmsg(udp_sk) with zero-copy to the kernel.
>> It is possible to do it this way. Zero-copy works best with packet sizes
>> starting at 32K and larger.  Anything less than that would consume the
>> improvements of zero-copy by zero-copy pre/post operations and needs to
>> align memory.
> Part of the cost of MSG_ZEROCOPY is in mapping and unmapping user
> pages. This series re-implements that with its own get_user_pages.
> That is duplicative non-trivial code. And it will incur the same cost.
> What this implementation saves is the (indeed non-trivial)
> asynchronous completion notification over the error queue.
>
> The cover letter gives some performance numbers against a userspace
> implementation that has to copy from user to kernel. It might be more
> even to compare against an implementation using MSG_ZEROCOPY and
> UDP_SEGMENT. A userspace crypto implementation may have other benefits
> compared to a kernel implementation, such as not having to convert to
> crypto API scatter-gather arrays and back to network structures.
>
> A few related points
>
> - The implementation support multiplexed connections, but only one
> crypto sendmsg can be outstanding at any time:
>
>    + /**
>    + * To synchronize concurrent sendmsg() requests through the same socket
>    + * and protect preallocated per-context memory.
>    + **/
>    + struct mutex sendmsg_mux;
>
> That is quite limiting for production workloads.

The use case that we have with MVFST library currently runs a single 
worker for a connection and has a single socket attached to it. QUIC 
allows simultaneous use of multiple connection IDs to swap them in 
runtime, and implementation would request only a handful of these. The 
MVFST batches writes into a block of about 8Kb and then uses GSO to send 
them all at once.

> - Crypto operations are also executed synchronously, using
> crypto_wait_req after each operationn. This limits throughput by using
> at most one core per UDP socket. And adds sendmsg latency (which may
> or may not be important to the application). Wireguard shows an
> example of how to parallelize software crypto across cores.
>
> - The implementation avoids dynamic allocation of cipher text pages by
> using a single ctx->cipher_page. This is protected by sendmsg_mux (see
> above). Is that safe when packets leave the protocol stack and are
> then held in a qdisc or when being processed by the NIC?
> quic_sendmsg_locked will return, but the cipher page is not free to
> reuse yet.
There is currently no use case that we have in hands that requires 
parallel transmission of data for the same connection. Multiple 
connections would have no issue running in parallel as each of them will 
have it's own preallocated cipher_page in the context.

There is a fragmentation further down the stack with 
ip_generic_getfrag() that eventually does copy_from_iter() and makea a 
copy of the data. This is executed as part of __ip_append_data() called 
from udp_sendmsg() in ipv4/udp.c. The assumption was that this is 
executed synchronously and the queues and NIC will see a mapping of a 
different memory area than the ciphertext in the pre-allocated page.

>
> - The real benefit of kernel QUIC will come from HW offload. Would it
> be better to avoid the complexity of an in-kernel software
> implementation and only focus on HW offload? Basically, pass the
> plaintext QUIC packets over a standard UDP socket and alongside in a
> cmsg pass either an index into a HW security association database or
> the immediate { key, iv } connection_info (for stateless sockets), to
> be encoded into the descriptor by the device driver.
Hardware usually targets a single ciphersuite such as AES-GCM-128/256, 
while QUIC also supports Chacha20-Poly1305 and AES-CCM. The generalized 
support for offload prompted implementation of these ciphers in kernel 
code. The kernel code could also engage if the future hardware has 
capacity caps preventing it from handling all requests in the hardware.
> - With such a simpler path, could we avoid introducing ULP and just
> have udp [gs]etsockopt CRYPTO_STATE. Where QUIC is the only defined
> state type yet.
>
> - Small aside: as the series introduces new APIs with non-trivial
> parsing in the kernel, it's good to run a fuzzer like syzkaller on it
> (if not having done so yet).
Agreed.
>> The other possible obstacle would be that eventual support
>> of QUIC encryption and decryption in hardware would integrate well with
>> this current approach.
>>> Didn't really understand the "GSO" you mentioned, as I don't see any
>>> code about kernel GSO, I guess it's just "Fragment size", right?
>>> BTW, it‘s not common to use "//" for the kernel annotation.
> minor point: fragment has meaning in IPv4. For GSO, prefer gso_size.
Sure, will change it to gso_size.
>
>> Once the payload arrives into the kernel, the GSO on the interface would
>> instruct L3/L4 stack on fragmentation. In this case, the plaintext QUIC
>> packets should be aligned on the GSO marks less the tag size that would
>> be added by encryption. For GSO size 1000, the QUIC packets in the batch
>> for transmission should all be 984 bytes long, except maybe the last
>> one. Once the tag is attached, the new size of 1000 will correctly split
>> the QUIC packets further down the stack for transmission in individual
>> IP/UDP packets. The code is also saving processing time by sending all
>> packets at once to UDP in a single call, when GSO is enabled.
>>> I'm not sure if it's worth adding a ULP layer over UDP for this QUIC
>>> TX only. Honestly, I'm more supporting doing a full QUIC stack in the
>>> kernel independently with socket APIs to use it:
>>> https://github.com/lxin/tls_hs.
>>>
>>> Thanks.

Willem de Bruijn Sept. 27, 2022, 5:12 p.m. UTC | #6

On Tue, Sep 27, 2022 at 12:45 PM Adel Abouchaev <adel.abushaev@gmail.com> wrote:
>
>
> On 9/25/22 11:04 AM, Willem de Bruijn wrote:
> >>> The patch seems to get the crypto_ctx by doing a connection hash table
> >>> lookup in the sendmsg(), which is not good from the performance side.
> >>> One QUIC connection can go over multiple UDP sockets, but I don't
> >>> think one socket can be used by multiple QUIC connections. So why not
> >>> save the ctx in the socket instead?
> >> A single socket could have multiple connections originated from it,
> >> having different destinations, if the socket is not connected. An
> >> optimization could be made for connected sockets to cache the context
> >> and save time on a lookup. The measurement of kernel operations timing
> >> did not reveal a significant amount of time spent in this lookup due to
> >> a relatively small number of connections per socket in general. A shared
> >> table across multiple sockets might experience a different performance
> >> grading.
> > I'm late to this patch series, sorry. High quality implementation. I
> > have a few design questions similar to Xin.
> >
> > If multiplexing, instead of looking up a connection by { address, port
> > variable length connection ID }, perhaps return a connection table
> > index on setsockopt and use that in sendmsg.
>
>
> It was deliberate to not to return anything other than 0 from
> setsockopt() as defined in the spec for the function. Despite that it
> says "shall", the doc says that 0 is the only value for successful
> operation. This was the reason not to use setsockopt() for any
> bidirectional transfers of data and or status. A more sophisticated
> approach with netlink sockets would be more suitable for it. The second
> reason is the API asymmetry for Tx and Rx which will be introduced - the

I thought the cover letter indicated that due to asymmetry of most
QUIC workloads, only Tx offload is implemented. You do intend to add
Rx later?

> Rx will still need to match on the address, port and cid. The third
> reason is that in current implementations there are no more than a few
> connections per socket,

This is very different from how we do QUIC at Google. There we
definitely multiplex many connections across essentially a socket per
CPU IIRC.

> which does not abuse the rhashtable that does a
> lookup, although it takes time to hash the key into a hash for a seek.
> The performance measurement ran against the runtime and did not flag
> this path as underperforming either, there were other parts that
> substantially add to the runtime, not the key lookup though.
>
>
> >>> The patch is to reduce the copying operations between user space and
> >>> the kernel. I might miss something in your user space code, but the
> >>> msg to send is *already packed* into the Stream Frame in user space,
> >>> what's the difference if you encrypt it in userspace and then
> >>> sendmsg(udp_sk) with zero-copy to the kernel.
> >> It is possible to do it this way. Zero-copy works best with packet sizes
> >> starting at 32K and larger.  Anything less than that would consume the
> >> improvements of zero-copy by zero-copy pre/post operations and needs to
> >> align memory.
> > Part of the cost of MSG_ZEROCOPY is in mapping and unmapping user
> > pages. This series re-implements that with its own get_user_pages.
> > That is duplicative non-trivial code. And it will incur the same cost.
> > What this implementation saves is the (indeed non-trivial)
> > asynchronous completion notification over the error queue.
> >
> > The cover letter gives some performance numbers against a userspace
> > implementation that has to copy from user to kernel. It might be more
> > even to compare against an implementation using MSG_ZEROCOPY and
> > UDP_SEGMENT. A userspace crypto implementation may have other benefits
> > compared to a kernel implementation, such as not having to convert to
> > crypto API scatter-gather arrays and back to network structures.
> >
> > A few related points
> >
> > - The implementation support multiplexed connections, but only one
> > crypto sendmsg can be outstanding at any time:
> >
> >    + /**
> >    + * To synchronize concurrent sendmsg() requests through the same socket
> >    + * and protect preallocated per-context memory.
> >    + **/
> >    + struct mutex sendmsg_mux;
> >
> > That is quite limiting for production workloads.
>
> The use case that we have with MVFST library currently runs a single
> worker for a connection and has a single socket attached to it. QUIC
> allows simultaneous use of multiple connection IDs to swap them in
> runtime, and implementation would request only a handful of these. The
> MVFST batches writes into a block of about 8Kb and then uses GSO to send
> them all at once.
>
> > - Crypto operations are also executed synchronously, using
> > crypto_wait_req after each operationn. This limits throughput by using
> > at most one core per UDP socket. And adds sendmsg latency (which may
> > or may not be important to the application). Wireguard shows an
> > example of how to parallelize software crypto across cores.
> >
> > - The implementation avoids dynamic allocation of cipher text pages by
> > using a single ctx->cipher_page. This is protected by sendmsg_mux (see
> > above). Is that safe when packets leave the protocol stack and are
> > then held in a qdisc or when being processed by the NIC?
> > quic_sendmsg_locked will return, but the cipher page is not free to
> > reuse yet.
> There is currently no use case that we have in hands that requires
> parallel transmission of data for the same connection. Multiple
> connections would have no issue running in parallel as each of them will
> have it's own preallocated cipher_page in the context.

This still leaves the point that sendmsg may return and the mutex
released while the cipher_page is still associated with an skb in the
transmit path.

> There is a fragmentation further down the stack with
> ip_generic_getfrag() that eventually does copy_from_iter() and makea a
> copy of the data. This is executed as part of __ip_append_data() called
> from udp_sendmsg() in ipv4/udp.c. The assumption was that this is
> executed synchronously and the queues and NIC will see a mapping of a
> different memory area than the ciphertext in the pre-allocated page.
>
> >
> > - The real benefit of kernel QUIC will come from HW offload. Would it
> > be better to avoid the complexity of an in-kernel software
> > implementation and only focus on HW offload? Basically, pass the
> > plaintext QUIC packets over a standard UDP socket and alongside in a
> > cmsg pass either an index into a HW security association database or
> > the immediate { key, iv } connection_info (for stateless sockets), to
> > be encoded into the descriptor by the device driver.
> Hardware usually targets a single ciphersuite such as AES-GCM-128/256,
> while QUIC also supports Chacha20-Poly1305 and AES-CCM. The generalized
> support for offload prompted implementation of these ciphers in kernel
> code.

All userspace libraries also support all protocols as fall-back. No
need for two fall-backs if HW support is missing?

> The kernel code could also engage if the future hardware has
> capacity caps preventing it from handling all requests in the hardware.
> > - With such a simpler path, could we avoid introducing ULP and just
> > have udp [gs]etsockopt CRYPTO_STATE. Where QUIC is the only defined
> > state type yet.
> >
> > - Small aside: as the series introduces new APIs with non-trivial
> > parsing in the kernel, it's good to run a fuzzer like syzkaller on it
> > (if not having done so yet).
> Agreed.
> >> The other possible obstacle would be that eventual support
> >> of QUIC encryption and decryption in hardware would integrate well with
> >> this current approach.
> >>> Didn't really understand the "GSO" you mentioned, as I don't see any
> >>> code about kernel GSO, I guess it's just "Fragment size", right?
> >>> BTW, it‘s not common to use "//" for the kernel annotation.
> > minor point: fragment has meaning in IPv4. For GSO, prefer gso_size.
> Sure, will change it to gso_size.
> >
> >> Once the payload arrives into the kernel, the GSO on the interface would
> >> instruct L3/L4 stack on fragmentation. In this case, the plaintext QUIC
> >> packets should be aligned on the GSO marks less the tag size that would
> >> be added by encryption. For GSO size 1000, the QUIC packets in the batch
> >> for transmission should all be 984 bytes long, except maybe the last
> >> one. Once the tag is attached, the new size of 1000 will correctly split
> >> the QUIC packets further down the stack for transmission in individual
> >> IP/UDP packets. The code is also saving processing time by sending all
> >> packets at once to UDP in a single call, when GSO is enabled.
> >>> I'm not sure if it's worth adding a ULP layer over UDP for this QUIC
> >>> TX only. Honestly, I'm more supporting doing a full QUIC stack in the
> >>> kernel independently with socket APIs to use it:
> >>> https://github.com/lxin/tls_hs.
> >>>
> >>> Thanks.

Adel Abouchaev Sept. 27, 2022, 5:28 p.m. UTC | #7

On 9/27/22 10:12 AM, Willem de Bruijn wrote:
> On Tue, Sep 27, 2022 at 12:45 PM Adel Abouchaev <adel.abushaev@gmail.com> wrote:
>>
>> On 9/25/22 11:04 AM, Willem de Bruijn wrote:
>>>>> The patch seems to get the crypto_ctx by doing a connection hash table
>>>>> lookup in the sendmsg(), which is not good from the performance side.
>>>>> One QUIC connection can go over multiple UDP sockets, but I don't
>>>>> think one socket can be used by multiple QUIC connections. So why not
>>>>> save the ctx in the socket instead?
>>>> A single socket could have multiple connections originated from it,
>>>> having different destinations, if the socket is not connected. An
>>>> optimization could be made for connected sockets to cache the context
>>>> and save time on a lookup. The measurement of kernel operations timing
>>>> did not reveal a significant amount of time spent in this lookup due to
>>>> a relatively small number of connections per socket in general. A shared
>>>> table across multiple sockets might experience a different performance
>>>> grading.
>>> I'm late to this patch series, sorry. High quality implementation. I
>>> have a few design questions similar to Xin.
>>>
>>> If multiplexing, instead of looking up a connection by { address, port
>>> variable length connection ID }, perhaps return a connection table
>>> index on setsockopt and use that in sendmsg.
>>
>> It was deliberate to not to return anything other than 0 from
>> setsockopt() as defined in the spec for the function. Despite that it
>> says "shall", the doc says that 0 is the only value for successful
>> operation. This was the reason not to use setsockopt() for any
>> bidirectional transfers of data and or status. A more sophisticated
>> approach with netlink sockets would be more suitable for it. The second
>> reason is the API asymmetry for Tx and Rx which will be introduced - the
> I thought the cover letter indicated that due to asymmetry of most
> QUIC workloads, only Tx offload is implemented. You do intend to add
> Rx later?
We are planning to include Rx later as well, Tx is more compelling from 
use case perspective as the main application of this would go to an 
Edge, where Tx is heavy and Rx is much lighter.
>> Rx will still need to match on the address, port and cid. The third
>> reason is that in current implementations there are no more than a few
>> connections per socket,
> This is very different from how we do QUIC at Google. There we
> definitely multiplex many connections across essentially a socket per
> CPU IIRC.
Ian mentioned that for such a case, while using a zero length QUIC CID, 
the remote ends are ephemeral ports and are different. In which case 
each connection will have its own context and yes, it will load the hash 
table quite a bit for a socket. The limiting factor is still a BSD4.4 
interface and returning a value from setsockopt() with the entry ID. It 
might be more suitable once a standard netlink interface is available 
for this to easily plug in. However, this will further complicate a 
userspace library integration which is a not so trivial task today too.
>
>> which does not abuse the rhashtable that does a
>> lookup, although it takes time to hash the key into a hash for a seek.
>> The performance measurement ran against the runtime and did not flag
>> this path as underperforming either, there were other parts that
>> substantially add to the runtime, not the key lookup though.
>>
>>
>>>>> The patch is to reduce the copying operations between user space and
>>>>> the kernel. I might miss something in your user space code, but the
>>>>> msg to send is *already packed* into the Stream Frame in user space,
>>>>> what's the difference if you encrypt it in userspace and then
>>>>> sendmsg(udp_sk) with zero-copy to the kernel.
>>>> It is possible to do it this way. Zero-copy works best with packet sizes
>>>> starting at 32K and larger.  Anything less than that would consume the
>>>> improvements of zero-copy by zero-copy pre/post operations and needs to
>>>> align memory.
>>> Part of the cost of MSG_ZEROCOPY is in mapping and unmapping user
>>> pages. This series re-implements that with its own get_user_pages.
>>> That is duplicative non-trivial code. And it will incur the same cost.
>>> What this implementation saves is the (indeed non-trivial)
>>> asynchronous completion notification over the error queue.
>>>
>>> The cover letter gives some performance numbers against a userspace
>>> implementation that has to copy from user to kernel. It might be more
>>> even to compare against an implementation using MSG_ZEROCOPY and
>>> UDP_SEGMENT. A userspace crypto implementation may have other benefits
>>> compared to a kernel implementation, such as not having to convert to
>>> crypto API scatter-gather arrays and back to network structures.
>>>
>>> A few related points
>>>
>>> - The implementation support multiplexed connections, but only one
>>> crypto sendmsg can be outstanding at any time:
>>>
>>>     + /**
>>>     + * To synchronize concurrent sendmsg() requests through the same socket
>>>     + * and protect preallocated per-context memory.
>>>     + **/
>>>     + struct mutex sendmsg_mux;
>>>
>>> That is quite limiting for production workloads.
>> The use case that we have with MVFST library currently runs a single
>> worker for a connection and has a single socket attached to it. QUIC
>> allows simultaneous use of multiple connection IDs to swap them in
>> runtime, and implementation would request only a handful of these. The
>> MVFST batches writes into a block of about 8Kb and then uses GSO to send
>> them all at once.
>>
>>> - Crypto operations are also executed synchronously, using
>>> crypto_wait_req after each operationn. This limits throughput by using
>>> at most one core per UDP socket. And adds sendmsg latency (which may
>>> or may not be important to the application). Wireguard shows an
>>> example of how to parallelize software crypto across cores.
>>>
>>> - The implementation avoids dynamic allocation of cipher text pages by
>>> using a single ctx->cipher_page. This is protected by sendmsg_mux (see
>>> above). Is that safe when packets leave the protocol stack and are
>>> then held in a qdisc or when being processed by the NIC?
>>> quic_sendmsg_locked will return, but the cipher page is not free to
>>> reuse yet.
>> There is currently no use case that we have in hands that requires
>> parallel transmission of data for the same connection. Multiple
>> connections would have no issue running in parallel as each of them will
>> have it's own preallocated cipher_page in the context.
> This still leaves the point that sendmsg may return and the mutex
> released while the cipher_page is still associated with an skb in the
> transmit path.
Correct, there is a further copy from the cipher buffer into fragmented 
pieces done by ip_generic_getfrag(). Am I reading it wrong that it would 
not need the cipher text allocated buffer after that is done and all the 
data is copied from it into further structures using copy_from_iter()? 
The skb would be built upon a different buffer space than the encrypted 
data. The udp_sendmsg() assumes that it receives the memory from 
userspace and does all these ops to move stuff from userspace. While 
doing that, it would omit tons of it's checks as the memory is already 
in kernel and still execute quickly.
>
>> There is a fragmentation further down the stack with
>> ip_generic_getfrag() that eventually does copy_from_iter() and makea a
>> copy of the data. This is executed as part of __ip_append_data() called
>> from udp_sendmsg() in ipv4/udp.c. The assumption was that this is
>> executed synchronously and the queues and NIC will see a mapping of a
>> different memory area than the ciphertext in the pre-allocated page.
>>
>>> - The real benefit of kernel QUIC will come from HW offload. Would it
>>> be better to avoid the complexity of an in-kernel software
>>> implementation and only focus on HW offload? Basically, pass the
>>> plaintext QUIC packets over a standard UDP socket and alongside in a
>>> cmsg pass either an index into a HW security association database or
>>> the immediate { key, iv } connection_info (for stateless sockets), to
>>> be encoded into the descriptor by the device driver.
>> Hardware usually targets a single ciphersuite such as AES-GCM-128/256,
>> while QUIC also supports Chacha20-Poly1305 and AES-CCM. The generalized
>> support for offload prompted implementation of these ciphers in kernel
>> code.
> All userspace libraries also support all protocols as fall-back. No
> need for two fall-backs if HW support is missing?
Could be done that way. Looking at TLS - it has fallback to kernel 
though in tls_sw, although that is a dual purpose code.
>> The kernel code could also engage if the future hardware has
>> capacity caps preventing it from handling all requests in the hardware.
>>> - With such a simpler path, could we avoid introducing ULP and just
>>> have udp [gs]etsockopt CRYPTO_STATE. Where QUIC is the only defined
>>> state type yet.
>>>
>>> - Small aside: as the series introduces new APIs with non-trivial
>>> parsing in the kernel, it's good to run a fuzzer like syzkaller on it
>>> (if not having done so yet).
>> Agreed.
>>>> The other possible obstacle would be that eventual support
>>>> of QUIC encryption and decryption in hardware would integrate well with
>>>> this current approach.
>>>>> Didn't really understand the "GSO" you mentioned, as I don't see any
>>>>> code about kernel GSO, I guess it's just "Fragment size", right?
>>>>> BTW, it‘s not common to use "//" for the kernel annotation.
>>> minor point: fragment has meaning in IPv4. For GSO, prefer gso_size.
>> Sure, will change it to gso_size.
>>>> Once the payload arrives into the kernel, the GSO on the interface would
>>>> instruct L3/L4 stack on fragmentation. In this case, the plaintext QUIC
>>>> packets should be aligned on the GSO marks less the tag size that would
>>>> be added by encryption. For GSO size 1000, the QUIC packets in the batch
>>>> for transmission should all be 984 bytes long, except maybe the last
>>>> one. Once the tag is attached, the new size of 1000 will correctly split
>>>> the QUIC packets further down the stack for transmission in individual
>>>> IP/UDP packets. The code is also saving processing time by sending all
>>>> packets at once to UDP in a single call, when GSO is enabled.
>>>>> I'm not sure if it's worth adding a ULP layer over UDP for this QUIC
>>>>> TX only. Honestly, I'm more supporting doing a full QUIC stack in the
>>>>> kernel independently with socket APIs to use it:
>>>>> https://github.com/lxin/tls_hs.
>>>>>
>>>>> Thanks.

[net-next,v2,0/6] net: support QUIC crypto

Message

Comments