[v5,net-next,00/15] AccECN protocol patch series

Message ID	20250422153602.54787-1-chia-yu.chang@nokia-bell-labs.com
Headers	show Received: from MRWPR03CU001.outbound.protection.outlook.com (mail-francesouthazon11011015.outbound.protection.outlook.com [40.107.130.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB82E28150A; Tue, 22 Apr 2025 15:37:48 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nokia-bell-labs.com designates 131.228.2.29 as permitted sender) receiver=protection.outlook.com; client-ip=131.228.2.29; helo=fihe3nok0735.emea.nsn-net.net; pr=C From: chia-yu.chang@nokia-bell-labs.com To: horms@kernel.org, dsahern@kernel.org, kuniyu@amazon.com, bpf@vger.kernel.org, netdev@vger.kernel.org, dave.taht@gmail.com, pabeni@redhat.com, jhs@mojatatu.com, kuba@kernel.org, stephen@networkplumber.org, xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net, edumazet@google.com, andrew+netdev@lunn.ch, donald.hunter@gmail.com, ast@fiberby.net, liuhangbin@gmail.com, shuah@kernel.org, linux-kselftest@vger.kernel.org, ij@kernel.org, ncardwell@google.com, koen.de_schepper@nokia-bell-labs.com, g.white@cablelabs.com, ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com, vidhi_goel@apple.com Cc: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Subject: [PATCH v5 net-next 00/15] AccECN protocol patch series Date: Tue, 22 Apr 2025 17:35:47 +0200 Message-Id: <20250422153602.54787-1-chia-yu.chang@nokia-bell-labs.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	AccECN protocol patch series \| expand [v5,net-next,00/15] AccECN protocol patch series [v5,net-next,02/15] tcp: fast path functions later [v5,net-next,04/15] tcp: accecn: AccECN negotiation [v5,net-next,05/15] tcp: accecn: add AccECN rx byte counters [v5,net-next,08/15] tcp: sack option handling improvements [v5,net-next,09/15] tcp: accecn: AccECN option [v5,net-next,12/15] tcp: accecn: AccECN option ceb/cep heuristic [v5,net-next,15/15] tcp: try to avoid safer when ACKs are thinned

Chia-Yu Chang (Nokia) April 22, 2025, 3:35 p.m. UTC

From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>

Hello,

Plese find the v5:

v5 (22-Apr-2025)
- Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms@kernel.org>)

v4 (18-Apr-2025)
- Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms@kernel.org>)

v3 (14-Apr-2025)
- Fix patch apply issue in v2 (Jakub Kicinski <kuba@kernel.org>)

v2 (18-Mar-2025)
- Add one missing patch from previous AccECN protocol preparation patch series to this patch series

The full patch series can be found in
https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/

The Accurate ECN draft can be found in
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28

Best regards,
Chia-Yu

Chia-Yu Chang (1):
  tcp: accecn: AccECN option failure handling

Ilpo Järvinen (14):
  tcp: reorganize SYN ECN code
  tcp: fast path functions later
  tcp: AccECN core
  tcp: accecn: AccECN negotiation
  tcp: accecn: add AccECN rx byte counters
  tcp: accecn: AccECN needs to know delivered bytes
  tcp: allow embedding leftover into option padding
  tcp: sack option handling improvements
  tcp: accecn: AccECN option
  tcp: accecn: AccECN option send control
  tcp: accecn: AccECN option ceb/cep heuristic
  tcp: accecn: AccECN ACE field multi-wrap heuristic
  tcp: accecn: try to fit AccECN option with SACK
  tcp: try to avoid safer when ACKs are thinned

 include/linux/tcp.h        |  27 +-
 include/net/netns/ipv4.h   |   2 +
 include/net/tcp.h          | 198 +++++++++++--
 include/uapi/linux/tcp.h   |   7 +
 net/ipv4/syncookies.c      |   3 +
 net/ipv4/sysctl_net_ipv4.c |  19 ++
 net/ipv4/tcp.c             |  26 +-
 net/ipv4/tcp_input.c       | 591 +++++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_ipv4.c        |   5 +-
 net/ipv4/tcp_minisocks.c   |  92 +++++-
 net/ipv4/tcp_output.c      | 302 +++++++++++++++++--
 net/ipv6/syncookies.c      |   1 +
 net/ipv6/tcp_ipv6.c        |   1 +
 13 files changed, 1178 insertions(+), 96 deletions(-)

Simon Horman April 24, 2025, 5:36 p.m. UTC | #1

On Tue, Apr 22, 2025 at 05:35:47PM +0200, chia-yu.chang@nokia-bell-labs.com wrote:
> From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> 
> Hello,
> 
> Plese find the v5:
> 
> v5 (22-Apr-2025)
> - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms@kernel.org>)
> 
> v4 (18-Apr-2025)
> - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms@kernel.org>)

Thanks, I confirm that v6 appears to be clear wrt these
build checks for 32-bit ARM assertion for alignment requirements.

...

Paolo Abeni April 29, 2025, 2:08 p.m. UTC | #2

On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> @@ -555,6 +556,30 @@ static void smc_check_reset_syn_req(const struct tcp_sock *oldtp,
>  #endif
>  }
>  
> +u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset)
> +{
> +	unsigned char *ptr = skb_transport_header(skb) + opt_offset;
> +	unsigned int optlen = ptr[1] - 2;
> +
> +	WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1);

This warn shoul be dropped, too.

/P

Chia-Yu Chang (Nokia) May 5, 2025, 3:24 p.m. UTC | #3

> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com> 
> Sent: Tuesday, April 29, 2025 12:14 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> Cc: Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com>
> Subject: Re: [PATCH v5 net-next 03/15] tcp: AccECN core
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > @@ -298,6 +298,9 @@ struct tcp_sock {
> >       u32     snd_up;         /* Urgent pointer               */
> >       u32     delivered;      /* Total data packets delivered incl. rexmits */
> >       u32     delivered_ce;   /* Like the above but only ECE marked packets */
> > +     u32     received_ce;    /* Like the above but for rcvd CE marked pkts */
> > +     u8      received_ce_pending:4, /* Not yet transmit cnt of received_ce */
> > +             unused2:4;
> 
> AFAICS this uses a 4 bytes hole present prior to this patch after "rcv_wnd", leaving a 3 bytes hole after 'unused2'. Possibly should be worth mentioning the hole presence.
> 
> @Eric: would it make sense use this hole for 'noneagle'/'rate_app_limited' and shrink the 'tcp_sock_write_txrx' group a bit?

Hi Paolo,

	Thanks for the feedback and sorry for my late response.
	I can either mention it here or move the places.
	However, as the following patches will continue change holes, so maybe I mention the hole change per patch make it more understandable.
	If this is find for you, then I will make revision in the next version.

> 
> [...]
> > @@ -5095,7 +5097,7 @@ static void __init tcp_struct_check(void)
> >       /* 32bit arches with 8byte alignment on u64 fields might need padding
> >        * before tcp_clock_cache.
> >        */
> > -     CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 92 + 4);
> > +     CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, 
> > + tcp_sock_write_txrx, 97 + 7);
> 
> Really? I *think* the change here should not move the cacheline end around, due to holes. Could you please include the relevant pahole
> (trimmed) output prior to this patch and after in the commit message?
> 

Here is pahole output before and after this patch.
Indeed, it creates 3 bytes hole after 'unused2' so it shall add (5+3)=8 to the original 92 + 4.
Finally, it will be 92 + 4 + (5 + 3) = 97 + 7.
	
*BEFORE this patch*
    __u8                       __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2585     0 */

    /* XXX 3 bytes hole, try to pack */

    __be32                     pred_flags;           /*  2588     4 */
    u64                        tcp_clock_cache;      /*  2592     8 */
    u64                        tcp_mstamp;           /*  2600     8 */
    u32                        rcv_nxt;              /*  2608     4 */
    u32                        snd_nxt;              /*  2612     4 */
    u32                        snd_una;              /*  2616     4 */
    u32                        window_clamp;         /*  2620     4 */
    /* --- cacheline 41 boundary (2624 bytes) --- */
    u32                        srtt_us;              /*  2624     4 */
    u32                        packets_out;          /*  2628     4 */
    u32                        snd_up;               /*  2632     4 */
    u32                        delivered;            /*  2636     4 */
    u32                        delivered_ce;         /*  2640     4 */
    u32                        app_limited;          /*  2644     4 */
    u32                        rcv_wnd;              /*  2648     4 */
    struct tcp_options_received rx_opt;              /*  2652    24 */
    u8                         nonagle:4;            /*  2676: 0  1 */
    u8                         rate_app_limited:1;   /*  2676: 4  1 */

    /* XXX 3 bits hole, try to pack */

    __u8                       __cacheline_group_end__tcp_sock_write_txrx[0]; /*  2677     0 */

    /* XXX 3 bytes hole, try to pack */

*AFTER this patch*
    __u8                       __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2585     0 */

    /* XXX 3 bytes hole, try to pack */              
    
    __be32                     pred_flags;           /*  2588     4 */
    u64                        tcp_clock_cache;      /*  2592     8 */
    u64                        tcp_mstamp;           /*  2600     8 */
    u32                        rcv_nxt;              /*  2608     4 */
    u32                        snd_nxt;              /*  2612     4 */
    u32                        snd_una;              /*  2616     4 */
    u32                        window_clamp;         /*  2620     4 */
    /* --- cacheline 41 boundary (2624 bytes) --- */
    u32                        srtt_us;              /*  2624     4 */
    u32                        packets_out;          /*  2628     4 */
    u32                        snd_up;               /*  2632     4 */
    u32                        delivered;            /*  2636     4 */
    u32                        delivered_ce;         /*  2640     4 */ 
    u32                        received_ce;          /*  2644     4 */
    u8                         received_ce_pending:4; /*  2648: 0  1 */
    u8                         unused2:4;            /*  2648: 4  1 */

    /* XXX 3 bytes hole, try to pack */

    u32                        app_limited;          /*  2652     4 */
    u32                        rcv_wnd;              /*  2656     4 */
    struct tcp_options_received rx_opt;              /*  2660    24 */
    u8                         nonagle:4;            /*  2684: 0  1 */
    u8                         rate_app_limited:1;   /*  2684: 4  1 */

    /* XXX 3 bits hole, try to pack */

    __u8                       __cacheline_group_end__tcp_sock_write_txrx[0]; /*  2685     0 */


> [...]
> > @@ -384,17 +387,16 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
> >               if (tcp_ca_needs_ecn(sk))
> >                       tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
> >
> > -             if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
> > +             if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) &&
> > +                 tcp_ecn_mode_rfc3168(tp)) {
> >                       /* Better not delay acks, sender can have a very low cwnd */
> >                       tcp_enter_quickack_mode(sk, 2);
> >                       tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
> >               }
> > -             tp->ecn_flags |= TCP_ECN_SEEN;
> 
> At this point is not entirely clear to me why the removal of the above line is needed/correct.

What I see is to move the place to set this flag from here to tcp_ecn_received_counters().

Also, this function will work when receiving data by calling tcp_event_data_recv(). 

While tcp_ecn_received_counters() takes effect in more places (e.g., either len <= tcp_header_len or NOT) to ensure ACE counter tracks all segments including pure ACKs.

> [...]
> > @@ -4056,6 +4118,11 @@ static int tcp_ack(struct sock *sk, const 
> > struct sk_buff *skb, int flag)
> >
> >       tcp_rack_update_reo_wnd(sk, &rs);
> >
> > +     if (tcp_ecn_mode_accecn(tp))
> > +             ecn_count = tcp_accecn_process(sk, skb,
> > +                                            tp->delivered - delivered,
> > +                                            &flag);
> 
> AFAICS the above could set FLAG_ECE in flags, menaning the previous
> tcp_clean_rtx_queue() will run with such flag cleared and the later function checking such flag will not. I wondering if this inconsistency could cause problems?

This flag set by tcp_accecn_process() will be used by following functions: tcp_in_ack_event(), tcp_fastretrans_alert().

And this shall only impact the AccECN mode.

Best regards,
Chia-Yu

> 
> /P

Chia-Yu Chang (Nokia) May 5, 2025, 9:53 p.m. UTC | #4

> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com> 
> Sent: Tuesday, April 29, 2025 2:10 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> Subject: Re: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option send control
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > From: Ilpo Järvinen <ij@kernel.org>
> >
> > Instead of sending the option in every ACK, limit sending to those 
> > ACKs where the option is necessary:
> > - Handshake
> > - "Change-triggered ACK" + the ACK following it. The
> >   2nd ACK is necessary to unambiguously indicate which
> >   of the ECN byte counters in increasing. The first
> >   ACK has two counters increasing due to the ecnfield
> >   edge.
> > - ACKs with CE to allow CEP delta validations to take
> >   advantage of the option.
> > - Force option to be sent every at least once per 2^22
> >   bytes. The check is done using the bit edges of the
> >   byte counters (avoids need for extra variables).
> > - AccECN option beacon to send a few times per RTT even if
> >   nothing in the ECN state requires that. The default is 3
> >   times per RTT, and its period can be set via
> >   sysctl_tcp_ecn_option_beacon.
> >
> > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > ---
> >  include/linux/tcp.h        |  3 +++
> >  include/net/netns/ipv4.h   |  1 +
> >  include/net/tcp.h          |  1 +
> >  net/ipv4/sysctl_net_ipv4.c |  9 ++++++++
> >  net/ipv4/tcp.c             |  5 ++++-
> >  net/ipv4/tcp_input.c       | 36 +++++++++++++++++++++++++++++++-
> >  net/ipv4/tcp_ipv4.c        |  1 +
> >  net/ipv4/tcp_minisocks.c   |  2 ++
> >  net/ipv4/tcp_output.c      | 42 ++++++++++++++++++++++++++++++--------
> >  9 files changed, 90 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 
> > 0e032d9631ac..acb0727855f8 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -309,8 +309,11 @@ struct tcp_sock {
> >       u8      received_ce_pending:4, /* Not yet transmit cnt of received_ce */
> >               unused2:4;
> >       u8      accecn_minlen:2,/* Minimum length of AccECN option sent */
> > +             prev_ecnfield:2,/* ECN bits from the previous segment */
> > +             accecn_opt_demand:2,/* Demand AccECN option for n next 
> > + ACKs */
> >               est_ecnfield:2;/* ECN field for AccECN delivered estimates */
> >       u32     app_limited;    /* limited until "delivered" reaches this val */
> > +     u64     accecn_opt_tstamp;      /* Last AccECN option sent timestamp */
> 
> AFAICS this field is only access in the tx path, while this chunk belong to the tcp_sock_write_txrx group.
> 
> > @@ -740,6 +740,15 @@ static struct ctl_table ipv4_net_table[] = {
> >               .extra1         = SYSCTL_ZERO,
> >               .extra2         = SYSCTL_TWO,
> >       },
> > +     {
> > +             .procname       = "tcp_ecn_option_beacon",
> > +             .data           = &init_net.ipv4.sysctl_tcp_ecn_option_beacon,
> > +             .maxlen         = sizeof(u8),
> > +             .mode           = 0644,
> > +             .proc_handler   = proc_dou8vec_minmax,
> > +             .extra1         = SYSCTL_ZERO,
> > +             .extra2         = SYSCTL_FOUR,
> 
> The number of new sysctl is concerning high, and I don't see any documentation update yet.

Hi Paolo,

The documentation is expected to be at the end of whole AccECN patch
https://github.com/L4STeam/linux-net-next/commit/03dcec1aec6aa774da4c1993b38a5b937040a11c

Or I can move it next to this patch.

> 
> > @@ -6291,9 +6294,36 @@ void tcp_ecn_received_counters(struct sock *sk, 
> > const struct sk_buff *skb,
> >
> >               if (payload_len > 0) {
> >                       u8 minlen = 
> > tcp_ecnfield_to_accecn_optfield(ecnfield);
> > +                     u32 oldbytes = tp->received_ecn_bytes[ecnfield - 
> > + 1];
> > +
> >                       tp->received_ecn_bytes[ecnfield - 1] += payload_len;
> >                       tp->accecn_minlen = max_t(u8, tp->accecn_minlen,
> >                                                 minlen);
> > +
> > +                     /* Demand AccECN option at least every 2^22 bytes to
> > +                      * avoid overflowing the ECN byte counters.
> > +                      */
> > +                     if ((tp->received_ecn_bytes[ecnfield - 1] ^ oldbytes) &
> > +                         ~((1 << 22) - 1)) {
> > +                             u8 opt_demand = max_t(u8, 1,
> > +                                                   
> > + tp->accecn_opt_demand);
> > +
> > +                             tp->accecn_opt_demand = opt_demand;
> > +                     }
> 
> I guess this explains the u32 values for such counters. Some comments in the previous patch could be useful.

Yes, like my previous email says, I would refer to the algorithm in AccECN draft.

> 
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 
> > 3f3e285fc973..2e95dad66fe3 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -3451,6 +3451,7 @@ static int __net_init tcp_sk_init(struct net 
> > *net)  {
> >       net->ipv4.sysctl_tcp_ecn = 2;
> >       net->ipv4.sysctl_tcp_ecn_option = 2;
> > +     net->ipv4.sysctl_tcp_ecn_option_beacon = 3;
> >       net->ipv4.sysctl_tcp_ecn_fallback = 1;
> 
> Human readable macros instead of magic numbers could help.

OK, commments will be added here.

> 
> > @@ -1237,13 +1253,18 @@ static unsigned int 
> > tcp_established_options(struct sock *sk, struct sk_buff *skb
> >
> >       if (tcp_ecn_mode_accecn(tp) &&
> >           sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> > -             int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> > -             int remaining = MAX_TCP_OPTION_SPACE - size;
> > -
> > -             opts->ecn_bytes = tp->received_ecn_bytes;
> > -             size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
> > -                                            remaining,
> > -                                            saving);
> > +             if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 ||
> > +                 tp->accecn_opt_demand ||
> > +                 tcp_accecn_option_beacon_check(sk)) {
> 
> Why a nested if here and just not expanding the existing one?

Sure, will merge them.

Chia-Yu

> 
> /P

Ilpo Järvinen May 5, 2025, 11:09 p.m. UTC | #5

On Tue, 29 Apr 2025, Paolo Abeni wrote:

> On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > @@ -709,6 +709,8 @@ static __be32 *process_tcp_ao_options(struct tcp_sock *tp,
> >  	return ptr;
> >  }
> >  
> > +#define NOP_LEFTOVER	((TCPOPT_NOP << 8) | TCPOPT_NOP)
> > +
> >  /* Write previously computed TCP options to the packet.
> >   *
> >   * Beware: Something in the Internet is very sensitive to the ordering of
> > @@ -727,8 +729,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> >  			      struct tcp_out_options *opts,
> >  			      struct tcp_key *key)
> >  {
> > +	u16 leftover_bytes = NOP_LEFTOVER;      /* replace next NOPs if avail */
> >  	__be32 *ptr = (__be32 *)(th + 1);
> >  	u16 options = opts->options;	/* mungable copy */
> > +	int leftover_size = 2;
> >  
> >  	if (tcp_key_is_md5(key)) {
> >  		*ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) |
> > @@ -763,17 +767,22 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> >  	}
> >  
> >  	if (unlikely(OPTION_SACK_ADVERTISE & options)) {
> > -		*ptr++ = htonl((TCPOPT_NOP << 24) |
> > -			       (TCPOPT_NOP << 16) |
> > +		*ptr++ = htonl((leftover_bytes << 16) |
> >  			       (TCPOPT_SACK_PERM << 8) |
> >  			       TCPOLEN_SACK_PERM);
> > +		leftover_bytes = NOP_LEFTOVER;
> 
> Why? isn't leftover_bytes already == NOP_LEFTOVER?
> 
> >  	}
> >  
> >  	if (unlikely(OPTION_WSCALE & options)) {
> > -		*ptr++ = htonl((TCPOPT_NOP << 24) |
> > +		u8 highbyte = TCPOPT_NOP;
> > +
> > +		if (unlikely(leftover_size == 1))
> 
> How can the above conditional be true?
> 
> > +			highbyte = leftover_bytes >> 8;
> > +		*ptr++ = htonl((highbyte << 24) |
> >  			       (TCPOPT_WINDOW << 16) |
> >  			       (TCPOLEN_WINDOW << 8) |
> >  			       opts->ws);
> > +		leftover_bytes = NOP_LEFTOVER;
> 
> Why? isn't leftover_bytes already == NOP_LEFTOVER?
> 
> >  	}
> >  
> >  	if (unlikely(opts->num_sack_blocks)) {
> > @@ -781,8 +790,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> >  			tp->duplicate_sack : tp->selective_acks;
> >  		int this_sack;
> >  
> > -		*ptr++ = htonl((TCPOPT_NOP  << 24) |
> > -			       (TCPOPT_NOP  << 16) |
> > +		*ptr++ = htonl((leftover_bytes << 16) |
> >  			       (TCPOPT_SACK <<  8) |
> >  			       (TCPOLEN_SACK_BASE + (opts->num_sack_blocks *
> >  						     TCPOLEN_SACK_PERBLOCK)));
> > @@ -794,6 +802,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> >  		}
> >  
> >  		tp->rx_opt.dsack = 0;
> > +	} else if (unlikely(leftover_bytes != NOP_LEFTOVER)) {
> 
> I really feel like I'm missing some code chunk, but I don't see any
> possible value for leftover_bytes other than NOP_LEFTOVER

Hi,
   
I split it this way to keep the generic part of the leftover mechanism in
own patch separate from AccECN option change itself (as you did later
discover). This is among the most convoluted parts in the entire AccECN
patch series so it seemed wise to split it as small as I could. Having
those non-sensical looking assigns here were to avoid code churn in the
latter patch. The changelog could have stated that clearly though (my
fault from years back).

Ilpo Järvinen May 5, 2025, 11:26 p.m. UTC | #6

On Mon, 5 May 2025, Chia-Yu Chang (Nokia) wrote:

> > -----Original Message-----
> > From: Paolo Abeni <pabeni@redhat.com> 
> > Sent: Tuesday, April 29, 2025 2:10 PM
> > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> > Subject: Re: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option send control
> > 
> > 
> > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> > 
> > 
> > 
> > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > > From: Ilpo Järvinen <ij@kernel.org>
> > >
> > > Instead of sending the option in every ACK, limit sending to those 
> > > ACKs where the option is necessary:
> > > - Handshake
> > > - "Change-triggered ACK" + the ACK following it. The
> > >   2nd ACK is necessary to unambiguously indicate which
> > >   of the ECN byte counters in increasing. The first
> > >   ACK has two counters increasing due to the ecnfield
> > >   edge.
> > > - ACKs with CE to allow CEP delta validations to take
> > >   advantage of the option.
> > > - Force option to be sent every at least once per 2^22
> > >   bytes. The check is done using the bit edges of the
> > >   byte counters (avoids need for extra variables).
> > > - AccECN option beacon to send a few times per RTT even if
> > >   nothing in the ECN state requires that. The default is 3
> > >   times per RTT, and its period can be set via
> > >   sysctl_tcp_ecn_option_beacon.
> > >
> > > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > > ---
> > >  include/linux/tcp.h        |  3 +++
> > >  include/net/netns/ipv4.h   |  1 +
> > >  include/net/tcp.h          |  1 +
> > >  net/ipv4/sysctl_net_ipv4.c |  9 ++++++++
> > >  net/ipv4/tcp.c             |  5 ++++-
> > >  net/ipv4/tcp_input.c       | 36 +++++++++++++++++++++++++++++++-
> > >  net/ipv4/tcp_ipv4.c        |  1 +
> > >  net/ipv4/tcp_minisocks.c   |  2 ++
> > >  net/ipv4/tcp_output.c      | 42 ++++++++++++++++++++++++++++++--------
> > >  9 files changed, 90 insertions(+), 10 deletions(-)

> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 
> > > 3f3e285fc973..2e95dad66fe3 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -3451,6 +3451,7 @@ static int __net_init tcp_sk_init(struct net 
> > > *net)  {
> > >       net->ipv4.sysctl_tcp_ecn = 2;
> > >       net->ipv4.sysctl_tcp_ecn_option = 2;
> > > +     net->ipv4.sysctl_tcp_ecn_option_beacon = 3;
> > >       net->ipv4.sysctl_tcp_ecn_fallback = 1;
> > 
> > Human readable macros instead of magic numbers could help.
> 
> OK, commments will be added here.

Hi,

Using named defines to replace literals would be more useful than comments 
(names can be grepped for, do not fall out-of-sync with code, etc.).

> > > @@ -1237,13 +1253,18 @@ static unsigned int 
> > > tcp_established_options(struct sock *sk, struct sk_buff *skb
> > >
> > >       if (tcp_ecn_mode_accecn(tp) &&
> > >           sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> > > -             int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> > > -             int remaining = MAX_TCP_OPTION_SPACE - size;
> > > -
> > > -             opts->ecn_bytes = tp->received_ecn_bytes;
> > > -             size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
> > > -                                            remaining,
> > > -                                            saving);
> > > +             if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 ||
> > > +                 tp->accecn_opt_demand ||
> > > +                 tcp_accecn_option_beacon_check(sk)) {
> > 
> > Why a nested if here and just not expanding the existing one?
> 
> Sure, will merge them.

While I don't remember everything that well anymore, this might have been 
to reduce code churn in some later patch, so it might be worth to check 
it first (that patch might even fall outside of this series now that these 
are split into multiple chunks).

Chia-Yu Chang (Nokia) May 6, 2025, 8:50 a.m. UTC | #7

> -----Original Message-----
> From: Ilpo Järvinen <ij@kernel.org> 
> Sent: Tuesday, May 6, 2025 1:10 AM
> To: Paolo Abeni <pabeni@redhat.com>
> Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> Subject: Re: [PATCH v5 net-next 07/15] tcp: allow embedding leftover into option padding
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On Tue, 29 Apr 2025, Paolo Abeni wrote:
> 
> > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > > @@ -709,6 +709,8 @@ static __be32 *process_tcp_ao_options(struct tcp_sock *tp,
> > >     return ptr;
> > >  }
> > >
> > > +#define NOP_LEFTOVER       ((TCPOPT_NOP << 8) | TCPOPT_NOP)
> > > +
> > >  /* Write previously computed TCP options to the packet.
> > >   *
> > >   * Beware: Something in the Internet is very sensitive to the 
> > > ordering of @@ -727,8 +729,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> > >                           struct tcp_out_options *opts,
> > >                           struct tcp_key *key)  {
> > > +   u16 leftover_bytes = NOP_LEFTOVER;      /* replace next NOPs if avail */
> > >     __be32 *ptr = (__be32 *)(th + 1);
> > >     u16 options = opts->options;    /* mungable copy */
> > > +   int leftover_size = 2;
> > >
> > >     if (tcp_key_is_md5(key)) {
> > >             *ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) | 
> > > @@ -763,17 +767,22 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> > >     }
> > >
> > >     if (unlikely(OPTION_SACK_ADVERTISE & options)) {
> > > -           *ptr++ = htonl((TCPOPT_NOP << 24) |
> > > -                          (TCPOPT_NOP << 16) |
> > > +           *ptr++ = htonl((leftover_bytes << 16) |
> > >                            (TCPOPT_SACK_PERM << 8) |
> > >                            TCPOLEN_SACK_PERM);
> > > +           leftover_bytes = NOP_LEFTOVER;
> >
> > Why? isn't leftover_bytes already == NOP_LEFTOVER?
> >
> > >     }
> > >
> > >     if (unlikely(OPTION_WSCALE & options)) {
> > > -           *ptr++ = htonl((TCPOPT_NOP << 24) |
> > > +           u8 highbyte = TCPOPT_NOP;
> > > +
> > > +           if (unlikely(leftover_size == 1))
> >
> > How can the above conditional be true?
> >
> > > +                   highbyte = leftover_bytes >> 8;
> > > +           *ptr++ = htonl((highbyte << 24) |
> > >                            (TCPOPT_WINDOW << 16) |
> > >                            (TCPOLEN_WINDOW << 8) |
> > >                            opts->ws);
> > > +           leftover_bytes = NOP_LEFTOVER;
> >
> > Why? isn't leftover_bytes already == NOP_LEFTOVER?
> >
> > >     }
> > >
> > >     if (unlikely(opts->num_sack_blocks)) { @@ -781,8 +790,7 @@ 
> > > static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> > >                     tp->duplicate_sack : tp->selective_acks;
> > >             int this_sack;
> > >
> > > -           *ptr++ = htonl((TCPOPT_NOP  << 24) |
> > > -                          (TCPOPT_NOP  << 16) |
> > > +           *ptr++ = htonl((leftover_bytes << 16) |
> > >                            (TCPOPT_SACK <<  8) |
> > >                            (TCPOLEN_SACK_BASE + (opts->num_sack_blocks *
> > >                                                  
> > > TCPOLEN_SACK_PERBLOCK))); @@ -794,6 +802,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> > >             }
> > >
> > >             tp->rx_opt.dsack = 0;
> > > +   } else if (unlikely(leftover_bytes != NOP_LEFTOVER)) {
> >
> > I really feel like I'm missing some code chunk, but I don't see any 
> > possible value for leftover_bytes other than NOP_LEFTOVER
> 
> Hi,
> 
> I split it this way to keep the generic part of the leftover mechanism in own patch separate from AccECN option change itself (as you did later discover). This is among the most convoluted parts in the entire AccECN patch series so it seemed wise to split it as small as I could. Having those non-sensical looking assigns here were to avoid code churn in the latter patch. The changelog could have stated that clearly though (my fault from years back).
> 
> 
> --
>  i.

Hi Ilpo,

Thanks for further clarifications, and I will add more comments in this patch.

Chia-Yu

Chia-Yu Chang (Nokia) May 6, 2025, 8:53 a.m. UTC | #8

> -----Original Message-----
> From: Ilpo Järvinen <ij@kernel.org> 
> Sent: Tuesday, May 6, 2025 1:27 AM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>
> Cc: Paolo Abeni <pabeni@redhat.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> Subject: RE: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option send control
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On Mon, 5 May 2025, Chia-Yu Chang (Nokia) wrote:
> 
> > > -----Original Message-----
> > > From: Paolo Abeni <pabeni@redhat.com>
> > > Sent: Tuesday, April 29, 2025 2:10 PM
> > > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; 
> > > horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; 
> > > bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; 
> > > jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; 
> > > xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; 
> > > edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; 
> > > ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; 
> > > linux-kselftest@vger.kernel.org; ij@kernel.org; 
> > > ncardwell@google.com; Koen De Schepper (Nokia) 
> > > <koen.de_schepper@nokia-bell-labs.com>; g.white 
> > > <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; 
> > > mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; 
> > > Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> > > Subject: Re: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option 
> > > send control
> > >
> > >
> > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> > >
> > >
> > >
> > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > > > From: Ilpo Järvinen <ij@kernel.org>
> > > >
> > > > Instead of sending the option in every ACK, limit sending to those 
> > > > ACKs where the option is necessary:
> > > > - Handshake
> > > > - "Change-triggered ACK" + the ACK following it. The
> > > >   2nd ACK is necessary to unambiguously indicate which
> > > >   of the ECN byte counters in increasing. The first
> > > >   ACK has two counters increasing due to the ecnfield
> > > >   edge.
> > > > - ACKs with CE to allow CEP delta validations to take
> > > >   advantage of the option.
> > > > - Force option to be sent every at least once per 2^22
> > > >   bytes. The check is done using the bit edges of the
> > > >   byte counters (avoids need for extra variables).
> > > > - AccECN option beacon to send a few times per RTT even if
> > > >   nothing in the ECN state requires that. The default is 3
> > > >   times per RTT, and its period can be set via
> > > >   sysctl_tcp_ecn_option_beacon.
> > > >
> > > > Signed-off-by: Ilpo Järvinen <ij@kernel.org>
> > > > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
> > > > ---
> > > >  include/linux/tcp.h        |  3 +++
> > > >  include/net/netns/ipv4.h   |  1 +
> > > >  include/net/tcp.h          |  1 +
> > > >  net/ipv4/sysctl_net_ipv4.c |  9 ++++++++
> > > >  net/ipv4/tcp.c             |  5 ++++-
> > > >  net/ipv4/tcp_input.c       | 36 +++++++++++++++++++++++++++++++-
> > > >  net/ipv4/tcp_ipv4.c        |  1 +
> > > >  net/ipv4/tcp_minisocks.c   |  2 ++
> > > >  net/ipv4/tcp_output.c      | 42 ++++++++++++++++++++++++++++++--------
> > > >  9 files changed, 90 insertions(+), 10 deletions(-)
> 
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index
> > > > 3f3e285fc973..2e95dad66fe3 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -3451,6 +3451,7 @@ static int __net_init tcp_sk_init(struct net
> > > > *net)  {
> > > >       net->ipv4.sysctl_tcp_ecn = 2;
> > > >       net->ipv4.sysctl_tcp_ecn_option = 2;
> > > > +     net->ipv4.sysctl_tcp_ecn_option_beacon = 3;
> > > >       net->ipv4.sysctl_tcp_ecn_fallback = 1;
> > >
> > > Human readable macros instead of magic numbers could help.
> >
> > OK, commments will be added here.
> 
> Hi,
> 
> Using named defines to replace literals would be more useful than comments (names can be grepped for, do not fall out-of-sync with code, etc.).
> 
> > > > @@ -1237,13 +1253,18 @@ static unsigned int 
> > > > tcp_established_options(struct sock *sk, struct sk_buff *skb
> > > >
> > > >       if (tcp_ecn_mode_accecn(tp) &&
> > > >           sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> > > > -             int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> > > > -             int remaining = MAX_TCP_OPTION_SPACE - size;
> > > > -
> > > > -             opts->ecn_bytes = tp->received_ecn_bytes;
> > > > -             size += tcp_options_fit_accecn(opts, tp->accecn_minlen,
> > > > -                                            remaining,
> > > > -                                            saving);
> > > > +             if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 ||
> > > > +                 tp->accecn_opt_demand ||
> > > > +                 tcp_accecn_option_beacon_check(sk)) {
> > >
> > > Why a nested if here and just not expanding the existing one?
> >
> > Sure, will merge them.
> 
> While I don't remember everything that well anymore, this might have been to reduce code churn in some later patch, so it might be worth to check it first (that patch might even fall outside of this series now that these are split into multiple chunks).
> 
> --
>  i.


Hi Ilpo,

Thanks for this raised point.
And I've checked that the condition is changed at the latter patches ("tcp: accecn: AccECN option failure handling"), but with the similar nested if.
So, I would try to merge it at the next version, and will check packetdrill for sure.

Chia-Yu

Chia-Yu Chang (Nokia) May 6, 2025, 1:56 p.m. UTC | #9

> -----Original Message-----
> From: Chia-Yu Chang (Nokia) 
> Sent: Monday, May 5, 2025 5:25 PM
> To: Paolo Abeni <pabeni@redhat.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> Cc: Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com>
> Subject: RE: [PATCH v5 net-next 03/15] tcp: AccECN core
> 
> > -----Original Message-----
> > From: Paolo Abeni <pabeni@redhat.com>
> > Sent: Tuesday, April 29, 2025 12:14 PM
> > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; 
> > horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; 
> > bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; 
> > jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; 
> > xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; 
> > edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; 
> > ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; 
> > linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; 
> > Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; 
> > g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; 
> > mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; 
> > Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> > Cc: Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com>
> > Subject: Re: [PATCH v5 net-next 03/15] tcp: AccECN core
> > 
> > 
> > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> > 
> > 
> > 
> > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > > @@ -298,6 +298,9 @@ struct tcp_sock {
> > >       u32     snd_up;         /* Urgent pointer               */
> > >       u32     delivered;      /* Total data packets delivered incl. rexmits */
> > >       u32     delivered_ce;   /* Like the above but only ECE marked packets */
> > > +     u32     received_ce;    /* Like the above but for rcvd CE marked pkts */
> > > +     u8      received_ce_pending:4, /* Not yet transmit cnt of received_ce */
> > > +             unused2:4;
> > 
> > AFAICS this uses a 4 bytes hole present prior to this patch after "rcv_wnd", leaving a 3 bytes hole after 'unused2'. Possibly should be worth mentioning the hole presence.
> > 
> > @Eric: would it make sense use this hole for 'noneagle'/'rate_app_limited' and shrink the 'tcp_sock_write_txrx' group a bit?

Hi,

By moving noneagle/rate_app_limited in the beginning of this group, I manage to reuse the 3-byte hole in the beginning of __cacheline_group_begin__tcp_sock_write_txrx.
Thus, I will include it in the next version, and you can find the pahole results below:


/*BEFORE this patch*/
__u8                       __cacheline_group_end__tcp_sock_write_tx[0]; /*  2585     0 */
__u8                       __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2585     0 */

/* XXX 3 bytes hole, try to pack */

__be32                     pred_flags;           /*  2588     4 */
u64                        tcp_clock_cache;      /*  2592     8 */
u64                        tcp_mstamp;           /*  2600     8 */
u32                        rcv_nxt;              /*  2608     4 */
u32                        snd_nxt;              /*  2612     4 */
u32                        snd_una;              /*  2616     4 */
u32                        window_clamp;         /*  2620     4 */
/* --- cacheline 41 boundary (2624 bytes) --- */
u32                        srtt_us;              /*  2624     4 */
u32                        packets_out;          /*  2628     4 */
u32                        snd_up;               /*  2632     4 */
u32                        delivered;            /*  2636     4 */
u32                        delivered_ce;         /*  2640     4 */
u32                        app_limited;          /*  2644     4 */
u32                        rcv_wnd;              /*  2648     4 */
struct tcp_options_received rx_opt;              /*  2652    24 */
u8                         nonagle:4;            /*  2676: 0  1 */
u8                         rate_app_limited:1;   /*  2676: 4  1 */

/* XXX 3 bits hole, try to pack */

__u8                       __cacheline_group_end__tcp_sock_write_txrx[0]; /*  2677     0 */

/* XXX 3 bytes hole, try to pack */

__u8                       __cacheline_group_begin__tcp_sock_write_rx[0] __attribute__((__aligned__(8))); /*  2680     0 */


/*AFTER this patch*/
__u8                       __cacheline_group_end__tcp_sock_write_tx[0]; /*  2585     0 */
__u8                       __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2585     0 */
u8                         nonagle:4;            /*  2585: 0  1 */
u8                         rate_app_limited:1;   /*  2585: 4  1 */

/* XXX 3 bits hole, try to pack */

/* Force alignment to the next boundary: */
u8                         :0;

u8                         received_ce_pending:4; /*  2586: 0  1 */
u8                         unused2:4;            /*  2586: 4  1 */

/* XXX 1 byte hole, try to pack */

__be32                     pred_flags;           /*  2588     4 */
u64                        tcp_clock_cache;      /*  2592     8 */
u64                        tcp_mstamp;           /*  2600     8 */
u32                        rcv_nxt;              /*  2608     4 */
u32                        snd_nxt;              /*  2612     4 */
u32                        snd_una;              /*  2616     4 */
u32                        window_clamp;         /*  2620     4 */
/* --- cacheline 41 boundary (2624 bytes) --- */
u32                        srtt_us;              /*  2624     4 */
u32                        packets_out;          /*  2628     4 */
u32                        snd_up;               /*  2632     4 */
u32                        delivered;            /*  2636     4 */
u32                        delivered_ce;         /*  2640     4 */
u32                        received_ce;          /*  2644     4 */
u32                        app_limited;          /*  2648     4 */
u32                        rcv_wnd;              /*  2652     4 */
struct tcp_options_received rx_opt;              /*  2656    24 */
__u8                       __cacheline_group_end__tcp_sock_write_txrx[0]; /*  2680     0 */
__u8                       __cacheline_group_begin__tcp_sock_write_rx[0] __attribute__((__aligned__(8))); /*  2680     0 */

> 
> Hi Paolo,
> 
> 	Thanks for the feedback and sorry for my late response.
> 	I can either mention it here or move the places.
> 	However, as the following patches will continue change holes, so maybe I mention the hole change per patch make it more understandable.
> 	If this is find for you, then I will make revision in the next version.
> 
> > 
> > [...]
> > > @@ -5095,7 +5097,7 @@ static void __init tcp_struct_check(void)
> > >       /* 32bit arches with 8byte alignment on u64 fields might need padding
> > >        * before tcp_clock_cache.
> > >        */
> > > -     CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 92 + 4);
> > > +     CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, 
> > > + tcp_sock_write_txrx, 97 + 7);
> > 
> > Really? I *think* the change here should not move the cacheline end 
> > around, due to holes. Could you please include the relevant pahole
> > (trimmed) output prior to this patch and after in the commit message?
> > 
> 
> Here is pahole output before and after this patch.
> Indeed, it creates 3 bytes hole after 'unused2' so it shall add (5+3)=8 to the original 92 + 4.
> Finally, it will be 92 + 4 + (5 + 3) = 97 + 7.
> 	
> *BEFORE this patch*
>     __u8                       __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2585     0 */
> 
>     /* XXX 3 bytes hole, try to pack */
> 
>     __be32                     pred_flags;           /*  2588     4 */
>     u64                        tcp_clock_cache;      /*  2592     8 */
>     u64                        tcp_mstamp;           /*  2600     8 */
>     u32                        rcv_nxt;              /*  2608     4 */
>     u32                        snd_nxt;              /*  2612     4 */
>     u32                        snd_una;              /*  2616     4 */
>     u32                        window_clamp;         /*  2620     4 */
>     /* --- cacheline 41 boundary (2624 bytes) --- */
>     u32                        srtt_us;              /*  2624     4 */
>     u32                        packets_out;          /*  2628     4 */
>     u32                        snd_up;               /*  2632     4 */
>     u32                        delivered;            /*  2636     4 */
>     u32                        delivered_ce;         /*  2640     4 */
>     u32                        app_limited;          /*  2644     4 */
>     u32                        rcv_wnd;              /*  2648     4 */
>     struct tcp_options_received rx_opt;              /*  2652    24 */
>     u8                         nonagle:4;            /*  2676: 0  1 */
>     u8                         rate_app_limited:1;   /*  2676: 4  1 */
> 
>     /* XXX 3 bits hole, try to pack */
> 
>     __u8                       __cacheline_group_end__tcp_sock_write_txrx[0]; /*  2677     0 */
> 
>     /* XXX 3 bytes hole, try to pack */
> 
> *AFTER this patch*
>     __u8                       __cacheline_group_begin__tcp_sock_write_txrx[0]; /*  2585     0 */
> 
>     /* XXX 3 bytes hole, try to pack */              
>     
>     __be32                     pred_flags;           /*  2588     4 */
>     u64                        tcp_clock_cache;      /*  2592     8 */
>     u64                        tcp_mstamp;           /*  2600     8 */
>     u32                        rcv_nxt;              /*  2608     4 */
>     u32                        snd_nxt;              /*  2612     4 */
>     u32                        snd_una;              /*  2616     4 */
>     u32                        window_clamp;         /*  2620     4 */
>     /* --- cacheline 41 boundary (2624 bytes) --- */
>     u32                        srtt_us;              /*  2624     4 */
>     u32                        packets_out;          /*  2628     4 */
>     u32                        snd_up;               /*  2632     4 */
>     u32                        delivered;            /*  2636     4 */
>     u32                        delivered_ce;         /*  2640     4 */ 
>     u32                        received_ce;          /*  2644     4 */
>     u8                         received_ce_pending:4; /*  2648: 0  1 */
>     u8                         unused2:4;            /*  2648: 4  1 */
> 
>     /* XXX 3 bytes hole, try to pack */
> 
>     u32                        app_limited;          /*  2652     4 */
>     u32                        rcv_wnd;              /*  2656     4 */
>     struct tcp_options_received rx_opt;              /*  2660    24 */
>     u8                         nonagle:4;            /*  2684: 0  1 */
>     u8                         rate_app_limited:1;   /*  2684: 4  1 */
> 
>     /* XXX 3 bits hole, try to pack */
> 
>     __u8                       __cacheline_group_end__tcp_sock_write_txrx[0]; /*  2685     0 */
> 
> 
> > [...]
> > > @@ -384,17 +387,16 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb)
> > >               if (tcp_ca_needs_ecn(sk))
> > >                       tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
> > >
> > > -             if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
> > > +             if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) &&
> > > +                 tcp_ecn_mode_rfc3168(tp)) {
> > >                       /* Better not delay acks, sender can have a very low cwnd */
> > >                       tcp_enter_quickack_mode(sk, 2);
> > >                       tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
> > >               }
> > > -             tp->ecn_flags |= TCP_ECN_SEEN;
> > 
> > At this point is not entirely clear to me why the removal of the above line is needed/correct.
> 
> What I see is to move the place to set this flag from here to tcp_ecn_received_counters().
> 
> Also, this function will work when receiving data by calling tcp_event_data_recv(). 
> 
> While tcp_ecn_received_counters() takes effect in more places (e.g., either len <= tcp_header_len or NOT) to ensure ACE counter tracks all segments including pure ACKs.
> 
> > [...]
> > > @@ -4056,6 +4118,11 @@ static int tcp_ack(struct sock *sk, const 
> > > struct sk_buff *skb, int flag)
> > >
> > >       tcp_rack_update_reo_wnd(sk, &rs);
> > >
> > > +     if (tcp_ecn_mode_accecn(tp))
> > > +             ecn_count = tcp_accecn_process(sk, skb,
> > > +                                            tp->delivered - delivered,
> > > +                                            &flag);
> > 
> > AFAICS the above could set FLAG_ECE in flags, menaning the previous
> > tcp_clean_rtx_queue() will run with such flag cleared and the later function checking such flag will not. I wondering if this inconsistency could cause problems?
> 
> This flag set by tcp_accecn_process() will be used by following functions: tcp_in_ack_event(), tcp_fastretrans_alert().
> 
> And this shall only impact the AccECN mode.
> 
> Best regards,
> Chia-Yu
> 
> > 
> > /P

[v5,net-next,00/15] AccECN protocol patch series

Message

Comments