[v5,net-next,09/15] tcp: accecn: AccECN option

Message ID	20250422153602.54787-10-chia-yu.chang@nokia-bell-labs.com
State	New
Headers	show Received: from EUR05-VI1-obe.outbound.protection.outlook.com (mail-vi1eur05on2076.outbound.protection.outlook.com [40.107.21.76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 15DB4290083; Tue, 22 Apr 2025 15:38:45 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of nokia-bell-labs.com designates 131.228.2.29 as permitted sender) receiver=protection.outlook.com; client-ip=131.228.2.29; helo=fihe3nok0735.emea.nsn-net.net; pr=C From: chia-yu.chang@nokia-bell-labs.com To: horms@kernel.org, dsahern@kernel.org, kuniyu@amazon.com, bpf@vger.kernel.org, netdev@vger.kernel.org, dave.taht@gmail.com, pabeni@redhat.com, jhs@mojatatu.com, kuba@kernel.org, stephen@networkplumber.org, xiyou.wangcong@gmail.com, jiri@resnulli.us, davem@davemloft.net, edumazet@google.com, andrew+netdev@lunn.ch, donald.hunter@gmail.com, ast@fiberby.net, liuhangbin@gmail.com, shuah@kernel.org, linux-kselftest@vger.kernel.org, ij@kernel.org, ncardwell@google.com, koen.de_schepper@nokia-bell-labs.com, g.white@cablelabs.com, ingemar.s.johansson@ericsson.com, mirja.kuehlewind@ericsson.com, cheshire@apple.com, rs.ietf@gmx.at, Jason_Livingood@comcast.com, vidhi_goel@apple.com Cc: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Subject: [PATCH v5 net-next 09/15] tcp: accecn: AccECN option Date: Tue, 22 Apr 2025 17:35:56 +0200 Message-Id: <20250422153602.54787-10-chia-yu.chang@nokia-bell-labs.com> In-Reply-To: <20250422153602.54787-1-chia-yu.chang@nokia-bell-labs.com> References: <20250422153602.54787-1-chia-yu.chang@nokia-bell-labs.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	AccECN protocol patch series \| expand [v5,net-next,00/15] AccECN protocol patch series [v5,net-next,02/15] tcp: fast path functions later [v5,net-next,04/15] tcp: accecn: AccECN negotiation [v5,net-next,05/15] tcp: accecn: add AccECN rx byte counters [v5,net-next,08/15] tcp: sack option handling improvements [v5,net-next,09/15] tcp: accecn: AccECN option [v5,net-next,12/15] tcp: accecn: AccECN option ceb/cep heuristic [v5,net-next,15/15] tcp: try to avoid safer when ACKs are thinned

Chia-Yu Chang (Nokia) April 22, 2025, 3:35 p.m. UTC

From: Ilpo Järvinen <ij@kernel.org>

The Accurate ECN allows echoing back the sum of bytes for
each IP ECN field value in the received packets using
AccECN option. This change implements AccECN option tx & rx
side processing without option send control related features
that are added by a later change.

Based on specification:
  https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt
(Some features of the spec will be added in the later changes
rather than in this one).

A full-length AccECN option is always attempted but if it does
not fit, the minimum length is selected based on the counters
that have changed since the last update. The AccECN option
(with 24-bit fields) often ends in odd sizes so the option
write code tries to take advantage of some nop used to pad
the other TCP options.

The delivered_ecn_bytes pairs with received_ecn_bytes similar
to how delivered_ce pairs with received_ce. In contrast to
ACE field, however, the option is not always available to update
delivered_ecn_bytes. For ACK w/o AccECN option, the delivered
bytes calculated based on the cumulative ACK+SACK information
are assigned to one of the counters using an estimation
heuristic to select the most likely ECN byte counter. Any
estimation error is corrected when the next AccECN option
arrives. It may occur that the heuristic gets too confused
when there are enough different byte counter deltas between
ACKs with the AccECN option in which case the heuristic just
gives up on updating the counters for a while.

tcp_ecn_option sysctl can be used to select option sending
mode for AccECN.

Signed-off-by: Ilpo Järvinen <ij@kernel.org>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com>
---
 include/linux/tcp.h        |   8 +-
 include/net/netns/ipv4.h   |   1 +
 include/net/tcp.h          |  13 +++
 include/uapi/linux/tcp.h   |   7 ++
 net/ipv4/sysctl_net_ipv4.c |   9 ++
 net/ipv4/tcp.c             |  15 +++-
 net/ipv4/tcp_input.c       | 171 +++++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_ipv4.c        |   1 +
 net/ipv4/tcp_output.c      | 129 ++++++++++++++++++++++++++++
 9 files changed, 346 insertions(+), 8 deletions(-)

Paolo Abeni April 29, 2025, 11:56 a.m. UTC | #1

On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> @@ -302,10 +303,13 @@ struct tcp_sock {
>  	u32	snd_up;		/* Urgent pointer		*/
>  	u32	delivered;	/* Total data packets delivered incl. rexmits */
>  	u32	delivered_ce;	/* Like the above but only ECE marked packets */
> +	u32	delivered_ecn_bytes[3];

This new fields do not belong to this cacheline group. I'm unsure they
belong to fast-path at all. Also u32 will wrap-around very soon.

[...]
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index dc8fdc80e16b..74ac8a5d2e00 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -298,6 +298,13 @@ struct tcp_info {
>  	__u32	tcpi_snd_wnd;	     /* peer's advertised receive window after
>  				      * scaling (bytes)
>  				      */
> +	__u32	tcpi_received_ce;    /* # of CE marks received */
> +	__u32	tcpi_delivered_e1_bytes;  /* Accurate ECN byte counters */
> +	__u32	tcpi_delivered_e0_bytes;
> +	__u32	tcpi_delivered_ce_bytes;
> +	__u32	tcpi_received_e1_bytes;
> +	__u32	tcpi_received_e0_bytes;
> +	__u32	tcpi_received_ce_bytes;

This will break uAPI: new fields must be addded at the end, or must fill
existing holes. Also u32 set in stone in uAPI for a byte counter looks
way too small.

> @@ -5100,7 +5113,7 @@ static void __init tcp_struct_check(void)
>  	/* 32bit arches with 8byte alignment on u64 fields might need padding
>  	 * before tcp_clock_cache.
>  	 */
> -	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 109 + 7);
> +	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 122 + 6);

The above means an additional cacheline in fast-path WRT the current
status. IMHO should be avoided.

> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5bd7fc9bcf66..41e45b9aff3f 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -70,6 +70,7 @@
>  #include <linux/sysctl.h>
>  #include <linux/kernel.h>
>  #include <linux/prefetch.h>
> +#include <linux/bitops.h>
>  #include <net/dst.h>
>  #include <net/tcp.h>
>  #include <net/proto_memory.h>
> @@ -499,6 +500,144 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
>  	return false;
>  }
>  
> +/* Maps IP ECN field ECT/CE code point to AccECN option field number, given
> + * we are sending fields with Accurate ECN Order 1: ECT(1), CE, ECT(0).
> + */
> +static u8 tcp_ecnfield_to_accecn_optfield(u8 ecnfield)
> +{
> +	switch (ecnfield) {
> +	case INET_ECN_NOT_ECT:
> +		return 0;	/* AccECN does not send counts of NOT_ECT */
> +	case INET_ECN_ECT_1:
> +		return 1;
> +	case INET_ECN_CE:
> +		return 2;
> +	case INET_ECN_ECT_0:
> +		return 3;
> +	default:
> +		WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);

No WARN_ONCE() above please: either the 'ecnfield' data is masked vs
INET_ECN_MASK and the WARN_ONCE should not be possible or a remote
sender can deterministically trigger a WARN() which nowadays will in
turn raise a CVE...

[...]
> +static u32 tcp_accecn_field_init_offset(u8 ecnfield)
> +{
> +	switch (ecnfield) {
> +	case INET_ECN_NOT_ECT:
> +		return 0;	/* AccECN does not send counts of NOT_ECT */
> +	case INET_ECN_ECT_1:
> +		return TCP_ACCECN_E1B_INIT_OFFSET;
> +	case INET_ECN_CE:
> +		return TCP_ACCECN_CEB_INIT_OFFSET;
> +	case INET_ECN_ECT_0:
> +		return TCP_ACCECN_E0B_INIT_OFFSET;
> +	default:
> +		WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);

Same as above.

> +	}
> +	return 0;
> +}
> +
> +/* Maps AccECN option field #nr to IP ECN field ECT/CE bits */
> +static unsigned int tcp_accecn_optfield_to_ecnfield(unsigned int optfield,
> +						    bool order)
> +{
> +	u8 tmp;
> +
> +	optfield = order ? 2 - optfield : optfield;
> +	tmp = optfield + 2;
> +
> +	return (tmp + (tmp >> 2)) & INET_ECN_MASK;
> +}
> +
> +/* Handles AccECN option ECT and CE 24-bit byte counters update into
> + * the u32 value in tcp_sock. As we're processing TCP options, it is
> + * safe to access from - 1.
> + */
> +static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 init_offset)
> +{
> +	u32 truncated = (get_unaligned_be32(from - 1) - init_offset) &
> +			0xFFFFFFU;
> +	u32 delta = (truncated - *cnt) & 0xFFFFFFU;
> +
> +	/* If delta has the highest bit set (24th bit) indicating
> +	 * negative, sign extend to correct an estimation using
> +	 * sign_extend32(delta, 24 - 1)
> +	 */
> +	delta = sign_extend32(delta, 23);
> +	*cnt += delta;
> +	return (s32)delta;
> +}
> +
> +/* Returns true if the byte counters can be used */
> +static bool tcp_accecn_process_option(struct tcp_sock *tp,
> +				      const struct sk_buff *skb,
> +				      u32 delivered_bytes, int flag)
> +{
> +	u8 estimate_ecnfield = tp->est_ecnfield;
> +	bool ambiguous_ecn_bytes_incr = false;
> +	bool first_changed = false;
> +	unsigned int optlen;
> +	unsigned char *ptr;
> +	bool order1, res;
> +	unsigned int i;
> +
> +	if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn) {
> +		if (estimate_ecnfield) {
> +			u8 ecnfield = estimate_ecnfield - 1;
> +
> +			tp->delivered_ecn_bytes[ecnfield] += delivered_bytes;
> +			return true;
> +		}
> +		return false;
> +	}
> +
> +	ptr = skb_transport_header(skb) + tp->rx_opt.accecn;
> +	optlen = ptr[1] - 2;

This assumes optlen is greater then 2, but I don't see the relevant
check. Are tcp options present at all?

> +	WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1);

Please, don't warn for arbitrary wrong data sent from the peer.

> +	order1 = (ptr[0] == TCPOPT_ACCECN1);
> +	ptr += 2;
> +
> +	res = !!estimate_ecnfield;
> +	for (i = 0; i < 3; i++) {
> +		if (optlen >= TCPOLEN_ACCECN_PERFIELD) {
> +			u32 init_offset;
> +			u8 ecnfield;
> +			s32 delta;
> +			u32 *cnt;
> +
> +			ecnfield = tcp_accecn_optfield_to_ecnfield(i, order1);
> +			init_offset = tcp_accecn_field_init_offset(ecnfield);
> +			cnt = &tp->delivered_ecn_bytes[ecnfield - 1];
> +			delta = tcp_update_ecn_bytes(cnt, ptr, init_offset);
> +			if (delta) {
> +				if (delta < 0) {
> +					res = false;
> +					ambiguous_ecn_bytes_incr = true;
> +				}
> +				if (ecnfield != estimate_ecnfield) {
> +					if (!first_changed) {
> +						tp->est_ecnfield = ecnfield;
> +						first_changed = true;
> +					} else {
> +						res = false;
> +						ambiguous_ecn_bytes_incr = true;
> +					}

At least 2 indentation levels above the maximum readable.

[...]
> @@ -4378,6 +4524,7 @@ void tcp_parse_options(const struct net *net,
>  
>  	ptr = (const unsigned char *)(th + 1);
>  	opt_rx->saw_tstamp = 0;
> +	opt_rx->accecn = 0;
>  	opt_rx->saw_unknown = 0;

It would be good to be able to zero both 'accecn' and 'saw_unknown' with
a single statement.

[...]
> @@ -766,6 +769,47 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
>  		*ptr++ = htonl(opts->tsecr);
>  	}
>  
> +	if (OPTION_ACCECN & options) {
> +		const u8 ect0_idx = INET_ECN_ECT_0 - 1;
> +		const u8 ect1_idx = INET_ECN_ECT_1 - 1;
> +		const u8 ce_idx = INET_ECN_CE - 1;
> +		u32 e0b;
> +		u32 e1b;
> +		u32 ceb;
> +		u8 len;
> +
> +		e0b = opts->ecn_bytes[ect0_idx] + TCP_ACCECN_E0B_INIT_OFFSET;
> +		e1b = opts->ecn_bytes[ect1_idx] + TCP_ACCECN_E1B_INIT_OFFSET;
> +		ceb = opts->ecn_bytes[ce_idx] + TCP_ACCECN_CEB_INIT_OFFSET;
> +		len = TCPOLEN_ACCECN_BASE +
> +		      opts->num_accecn_fields * TCPOLEN_ACCECN_PERFIELD;
> +
> +		if (opts->num_accecn_fields == 2) {
> +			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> +				       ((e1b >> 8) & 0xffff));
> +			*ptr++ = htonl(((e1b & 0xff) << 24) |
> +				       (ceb & 0xffffff));
> +		} else if (opts->num_accecn_fields == 1) {
> +			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> +				       ((e1b >> 8) & 0xffff));
> +			leftover_bytes = ((e1b & 0xff) << 8) |
> +					 TCPOPT_NOP;
> +			leftover_size = 1;
> +		} else if (opts->num_accecn_fields == 0) {
> +			leftover_bytes = (TCPOPT_ACCECN1 << 8) | len;
> +			leftover_size = 2;
> +		} else if (opts->num_accecn_fields == 3) {
> +			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> +				       ((e1b >> 8) & 0xffff));
> +			*ptr++ = htonl(((e1b & 0xff) << 24) |
> +				       (ceb & 0xffffff));
> +			*ptr++ = htonl(((e0b & 0xffffff) << 8) |
> +				       TCPOPT_NOP);

The above chunck and the contents of patch 7 must be in the same patch.
This split makes the review even harder.

[...]
> @@ -1117,6 +1235,17 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
>  		opts->num_sack_blocks = 0;
>  	}
>  
> +	if (tcp_ecn_mode_accecn(tp) &&
> +	    sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> +		int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> +		int remaining = MAX_TCP_OPTION_SPACE - size;

AFACS the above means tcp_options_fit_accecn() must clear any already
set options, but apparently it does not do so. Have you tested with
something adding largish options like mptcp?

/P

Chia-Yu Chang (Nokia) May 5, 2025, 9:47 p.m. UTC | #2

> -----Original Message-----
> From: Paolo Abeni <pabeni@redhat.com> 
> Sent: Tuesday, April 29, 2025 1:56 PM
> To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> Subject: Re: [PATCH v5 net-next 09/15] tcp: accecn: AccECN option
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > @@ -302,10 +303,13 @@ struct tcp_sock {
> >       u32     snd_up;         /* Urgent pointer               */
> >       u32     delivered;      /* Total data packets delivered incl. rexmits */
> >       u32     delivered_ce;   /* Like the above but only ECE marked packets */
> > +     u32     delivered_ecn_bytes[3];
> 
> This new fields do not belong to this cacheline group. I'm unsure they belong to fast-path at all. Also u32 will wrap-around very soon.

Hi Paolo,

Thanks for the feedback.

Could you help to advise then which cacheline group ie belongs to?
If there are some tools can be shared will be appreciated.

> 
> [...]
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 
> > dc8fdc80e16b..74ac8a5d2e00 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -298,6 +298,13 @@ struct tcp_info {
> >       __u32   tcpi_snd_wnd;        /* peer's advertised receive window after
> >                                     * scaling (bytes)
> >                                     */
> > +     __u32   tcpi_received_ce;    /* # of CE marks received */
> > +     __u32   tcpi_delivered_e1_bytes;  /* Accurate ECN byte counters */
> > +     __u32   tcpi_delivered_e0_bytes;
> > +     __u32   tcpi_delivered_ce_bytes;
> > +     __u32   tcpi_received_e1_bytes;
> > +     __u32   tcpi_received_e0_bytes;
> > +     __u32   tcpi_received_ce_bytes;
> 
> This will break uAPI: new fields must be addded at the end, or must fill existing holes. Also u32 set in stone in uAPI for a byte counter looks way too small.

I will move at the end or fill existing holes using pahole.
Indeed u32 is not big, but based on the algorithms in A.2.1 and A.1. of AccECN draft, the byte counter greater than 24b shall be fine.
And this is also verfied using TCP Prague.

> 
> > @@ -5100,7 +5113,7 @@ static void __init tcp_struct_check(void)
> >       /* 32bit arches with 8byte alignment on u64 fields might need padding
> >        * before tcp_clock_cache.
> >        */
> > -     CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 109 + 7);
> > +     CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, 
> > + tcp_sock_write_txrx, 122 + 6);
> 
> The above means an additional cacheline in fast-path WRT the current status. IMHO should be avoided.

OK, I did this to avoid the line width warning of patchcheck, but will change it back.

> 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 
> > 5bd7fc9bcf66..41e45b9aff3f 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -70,6 +70,7 @@
> >  #include <linux/sysctl.h>
> >  #include <linux/kernel.h>
> >  #include <linux/prefetch.h>
> > +#include <linux/bitops.h>
> >  #include <net/dst.h>
> >  #include <net/tcp.h>
> >  #include <net/proto_memory.h>
> > @@ -499,6 +500,144 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
> >       return false;
> >  }
> >
> > +/* Maps IP ECN field ECT/CE code point to AccECN option field number, 
> > +given
> > + * we are sending fields with Accurate ECN Order 1: ECT(1), CE, ECT(0).
> > + */
> > +static u8 tcp_ecnfield_to_accecn_optfield(u8 ecnfield) {
> > +     switch (ecnfield) {
> > +     case INET_ECN_NOT_ECT:
> > +             return 0;       /* AccECN does not send counts of NOT_ECT */
> > +     case INET_ECN_ECT_1:
> > +             return 1;
> > +     case INET_ECN_CE:
> > +             return 2;
> > +     case INET_ECN_ECT_0:
> > +             return 3;
> > +     default:
> > +             WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> 
> No WARN_ONCE() above please: either the 'ecnfield' data is masked vs INET_ECN_MASK and the WARN_ONCE should not be possible or a remote sender can deterministically trigger a WARN() which nowadays will in turn raise a CVE...

Sure, I will add the mask here.

> 
> [...]
> > +static u32 tcp_accecn_field_init_offset(u8 ecnfield) {
> > +     switch (ecnfield) {
> > +     case INET_ECN_NOT_ECT:
> > +             return 0;       /* AccECN does not send counts of NOT_ECT */
> > +     case INET_ECN_ECT_1:
> > +             return TCP_ACCECN_E1B_INIT_OFFSET;
> > +     case INET_ECN_CE:
> > +             return TCP_ACCECN_CEB_INIT_OFFSET;
> > +     case INET_ECN_ECT_0:
> > +             return TCP_ACCECN_E0B_INIT_OFFSET;
> > +     default:
> > +             WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> 
> Same as above.
> 
> > +     }
> > +     return 0;
> > +}
> > +
> > +/* Maps AccECN option field #nr to IP ECN field ECT/CE bits */ static 
> > +unsigned int tcp_accecn_optfield_to_ecnfield(unsigned int optfield,
> > +                                                 bool order) {
> > +     u8 tmp;
> > +
> > +     optfield = order ? 2 - optfield : optfield;
> > +     tmp = optfield + 2;
> > +
> > +     return (tmp + (tmp >> 2)) & INET_ECN_MASK; }
> > +
> > +/* Handles AccECN option ECT and CE 24-bit byte counters update into
> > + * the u32 value in tcp_sock. As we're processing TCP options, it is
> > + * safe to access from - 1.
> > + */
> > +static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 
> > +init_offset) {
> > +     u32 truncated = (get_unaligned_be32(from - 1) - init_offset) &
> > +                     0xFFFFFFU;
> > +     u32 delta = (truncated - *cnt) & 0xFFFFFFU;
> > +
> > +     /* If delta has the highest bit set (24th bit) indicating
> > +      * negative, sign extend to correct an estimation using
> > +      * sign_extend32(delta, 24 - 1)
> > +      */
> > +     delta = sign_extend32(delta, 23);
> > +     *cnt += delta;
> > +     return (s32)delta;
> > +}
> > +
> > +/* Returns true if the byte counters can be used */ static bool 
> > +tcp_accecn_process_option(struct tcp_sock *tp,
> > +                                   const struct sk_buff *skb,
> > +                                   u32 delivered_bytes, int flag) {
> > +     u8 estimate_ecnfield = tp->est_ecnfield;
> > +     bool ambiguous_ecn_bytes_incr = false;
> > +     bool first_changed = false;
> > +     unsigned int optlen;
> > +     unsigned char *ptr;
> > +     bool order1, res;
> > +     unsigned int i;
> > +
> > +     if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn) {
> > +             if (estimate_ecnfield) {
> > +                     u8 ecnfield = estimate_ecnfield - 1;
> > +
> > +                     tp->delivered_ecn_bytes[ecnfield] += delivered_bytes;
> > +                     return true;
> > +             }
> > +             return false;
> > +     }
> > +
> > +     ptr = skb_transport_header(skb) + tp->rx_opt.accecn;
> > +     optlen = ptr[1] - 2;
> 
> This assumes optlen is greater then 2, but I don't see the relevant check. Are tcp options present at all?

This function is executed only when AccECN mode is negotiated.
And the above condition "if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn)" covers the case in which AccECN option is not present.
So, I would think this is safe; please let me know if you think otherwise.

> 
> > +     WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != 
> > + TCPOPT_ACCECN1);
> 
> Please, don't warn for arbitrary wrong data sent from the peer.

Sure, will remove.

> 
> > +     order1 = (ptr[0] == TCPOPT_ACCECN1);
> > +     ptr += 2;
> > +
> > +     res = !!estimate_ecnfield;
> > +     for (i = 0; i < 3; i++) {
> > +             if (optlen >= TCPOLEN_ACCECN_PERFIELD) {
> > +                     u32 init_offset;
> > +                     u8 ecnfield;
> > +                     s32 delta;
> > +                     u32 *cnt;
> > +
> > +                     ecnfield = tcp_accecn_optfield_to_ecnfield(i, order1);
> > +                     init_offset = tcp_accecn_field_init_offset(ecnfield);
> > +                     cnt = &tp->delivered_ecn_bytes[ecnfield - 1];
> > +                     delta = tcp_update_ecn_bytes(cnt, ptr, init_offset);
> > +                     if (delta) {
> > +                             if (delta < 0) {
> > +                                     res = false;
> > +                                     ambiguous_ecn_bytes_incr = true;
> > +                             }
> > +                             if (ecnfield != estimate_ecnfield) {
> > +                                     if (!first_changed) {
> > +                                             tp->est_ecnfield = ecnfield;
> > +                                             first_changed = true;
> > +                                     } else {
> > +                                             res = false;
> > +                                             ambiguous_ecn_bytes_incr = true;
> > +                                     }
> 
> At least 2 indentation levels above the maximum readable.

OK, let me think how to simplify it in next version.

> 
> [...]
> > @@ -4378,6 +4524,7 @@ void tcp_parse_options(const struct net *net,
> >
> >       ptr = (const unsigned char *)(th + 1);
> >       opt_rx->saw_tstamp = 0;
> > +     opt_rx->accecn = 0;
> >       opt_rx->saw_unknown = 0;
> 
> It would be good to be able to zero both 'accecn' and 'saw_unknown' with a single statement.

ok, will do.

> 
> [...]
> > @@ -766,6 +769,47 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> >               *ptr++ = htonl(opts->tsecr);
> >       }
> >
> > +     if (OPTION_ACCECN & options) {
> > +             const u8 ect0_idx = INET_ECN_ECT_0 - 1;
> > +             const u8 ect1_idx = INET_ECN_ECT_1 - 1;
> > +             const u8 ce_idx = INET_ECN_CE - 1;
> > +             u32 e0b;
> > +             u32 e1b;
> > +             u32 ceb;
> > +             u8 len;
> > +
> > +             e0b = opts->ecn_bytes[ect0_idx] + TCP_ACCECN_E0B_INIT_OFFSET;
> > +             e1b = opts->ecn_bytes[ect1_idx] + TCP_ACCECN_E1B_INIT_OFFSET;
> > +             ceb = opts->ecn_bytes[ce_idx] + TCP_ACCECN_CEB_INIT_OFFSET;
> > +             len = TCPOLEN_ACCECN_BASE +
> > +                   opts->num_accecn_fields * TCPOLEN_ACCECN_PERFIELD;
> > +
> > +             if (opts->num_accecn_fields == 2) {
> > +                     *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > +                                    ((e1b >> 8) & 0xffff));
> > +                     *ptr++ = htonl(((e1b & 0xff) << 24) |
> > +                                    (ceb & 0xffffff));
> > +             } else if (opts->num_accecn_fields == 1) {
> > +                     *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > +                                    ((e1b >> 8) & 0xffff));
> > +                     leftover_bytes = ((e1b & 0xff) << 8) |
> > +                                      TCPOPT_NOP;
> > +                     leftover_size = 1;
> > +             } else if (opts->num_accecn_fields == 0) {
> > +                     leftover_bytes = (TCPOPT_ACCECN1 << 8) | len;
> > +                     leftover_size = 2;
> > +             } else if (opts->num_accecn_fields == 3) {
> > +                     *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > +                                    ((e1b >> 8) & 0xffff));
> > +                     *ptr++ = htonl(((e1b & 0xff) << 24) |
> > +                                    (ceb & 0xffffff));
> > +                     *ptr++ = htonl(((e0b & 0xffffff) << 8) |
> > +                                    TCPOPT_NOP);
> 
> The above chunck and the contents of patch 7 must be in the same patch.
> This split makes the review even harder.

Thanks for feedback, I will mrege these 2 patches.

> 
> [...]
> > @@ -1117,6 +1235,17 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> >               opts->num_sack_blocks = 0;
> >       }
> >
> > +     if (tcp_ecn_mode_accecn(tp) &&
> > +         sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> > +             int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> > +             int remaining = MAX_TCP_OPTION_SPACE - size;
> 
> AFACS the above means tcp_options_fit_accecn() must clear any already set options, but apparently it does not do so. Have you tested with something adding largish options like mptcp?

I see this part is NOT to clear already set option, but to calculate how long AccECN option will fit to remaining option space.
> 
> /P

Ilpo Järvinen May 5, 2025, 10:54 p.m. UTC | #3

On Tue, 29 Apr 2025, Paolo Abeni wrote:

> On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > @@ -302,10 +303,13 @@ struct tcp_sock {
> >  	u32	snd_up;		/* Urgent pointer		*/
> >  	u32	delivered;	/* Total data packets delivered incl. rexmits */
> >  	u32	delivered_ce;	/* Like the above but only ECE marked packets */
> > +	u32	delivered_ecn_bytes[3];
> 
> This new fields do not belong to this cacheline group. I'm unsure they
> belong to fast-path at all. Also u32 will wrap-around very soon.
> 
> [...]
> > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> > index dc8fdc80e16b..74ac8a5d2e00 100644
> > --- a/include/uapi/linux/tcp.h
> > +++ b/include/uapi/linux/tcp.h
> > @@ -298,6 +298,13 @@ struct tcp_info {
> >  	__u32	tcpi_snd_wnd;	     /* peer's advertised receive window after
> >  				      * scaling (bytes)
> >  				      */
> > +	__u32	tcpi_received_ce;    /* # of CE marks received */
> > +	__u32	tcpi_delivered_e1_bytes;  /* Accurate ECN byte counters */
> > +	__u32	tcpi_delivered_e0_bytes;
> > +	__u32	tcpi_delivered_ce_bytes;
> > +	__u32	tcpi_received_e1_bytes;
> > +	__u32	tcpi_received_e0_bytes;
> > +	__u32	tcpi_received_ce_bytes;
> 
> This will break uAPI: new fields must be addded at the end, or must fill
> existing holes. Also u32 set in stone in uAPI for a byte counter looks
> way too small.
> 
> > @@ -5100,7 +5113,7 @@ static void __init tcp_struct_check(void)
> >  	/* 32bit arches with 8byte alignment on u64 fields might need padding
> >  	 * before tcp_clock_cache.
> >  	 */
> > -	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 109 + 7);
> > +	CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 122 + 6);
> 
> The above means an additional cacheline in fast-path WRT the current
> status. IMHO should be avoided.
> 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 5bd7fc9bcf66..41e45b9aff3f 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -70,6 +70,7 @@
> >  #include <linux/sysctl.h>
> >  #include <linux/kernel.h>
> >  #include <linux/prefetch.h>
> > +#include <linux/bitops.h>
> >  #include <net/dst.h>
> >  #include <net/tcp.h>
> >  #include <net/proto_memory.h>
> > @@ -499,6 +500,144 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
> >  	return false;
> >  }
> >  
> > +/* Maps IP ECN field ECT/CE code point to AccECN option field number, given
> > + * we are sending fields with Accurate ECN Order 1: ECT(1), CE, ECT(0).
> > + */
> > +static u8 tcp_ecnfield_to_accecn_optfield(u8 ecnfield)
> > +{
> > +	switch (ecnfield) {
> > +	case INET_ECN_NOT_ECT:
> > +		return 0;	/* AccECN does not send counts of NOT_ECT */
> > +	case INET_ECN_ECT_1:
> > +		return 1;
> > +	case INET_ECN_CE:
> > +		return 2;
> > +	case INET_ECN_ECT_0:
> > +		return 3;
> > +	default:
> > +		WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> 
> No WARN_ONCE() above please: either the 'ecnfield' data is masked vs
> INET_ECN_MASK and the WARN_ONCE should not be possible or a remote
> sender can deterministically trigger a WARN() which nowadays will in
> turn raise a CVE...
> 
> [...]
> > +static u32 tcp_accecn_field_init_offset(u8 ecnfield)
> > +{
> > +	switch (ecnfield) {
> > +	case INET_ECN_NOT_ECT:
> > +		return 0;	/* AccECN does not send counts of NOT_ECT */
> > +	case INET_ECN_ECT_1:
> > +		return TCP_ACCECN_E1B_INIT_OFFSET;
> > +	case INET_ECN_CE:
> > +		return TCP_ACCECN_CEB_INIT_OFFSET;
> > +	case INET_ECN_ECT_0:
> > +		return TCP_ACCECN_E0B_INIT_OFFSET;
> > +	default:
> > +		WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> 
> Same as above.
> 
> > +	}
> > +	return 0;
> > +}
> > +
> > +/* Maps AccECN option field #nr to IP ECN field ECT/CE bits */
> > +static unsigned int tcp_accecn_optfield_to_ecnfield(unsigned int optfield,
> > +						    bool order)
> > +{
> > +	u8 tmp;
> > +
> > +	optfield = order ? 2 - optfield : optfield;
> > +	tmp = optfield + 2;
> > +
> > +	return (tmp + (tmp >> 2)) & INET_ECN_MASK;
> > +}
> > +
> > +/* Handles AccECN option ECT and CE 24-bit byte counters update into
> > + * the u32 value in tcp_sock. As we're processing TCP options, it is
> > + * safe to access from - 1.
> > + */
> > +static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 init_offset)
> > +{
> > +	u32 truncated = (get_unaligned_be32(from - 1) - init_offset) &
> > +			0xFFFFFFU;
> > +	u32 delta = (truncated - *cnt) & 0xFFFFFFU;
> > +
> > +	/* If delta has the highest bit set (24th bit) indicating
> > +	 * negative, sign extend to correct an estimation using
> > +	 * sign_extend32(delta, 24 - 1)
> > +	 */
> > +	delta = sign_extend32(delta, 23);
> > +	*cnt += delta;
> > +	return (s32)delta;
> > +}
> > +
> > +/* Returns true if the byte counters can be used */
> > +static bool tcp_accecn_process_option(struct tcp_sock *tp,
> > +				      const struct sk_buff *skb,
> > +				      u32 delivered_bytes, int flag)
> > +{
> > +	u8 estimate_ecnfield = tp->est_ecnfield;
> > +	bool ambiguous_ecn_bytes_incr = false;
> > +	bool first_changed = false;
> > +	unsigned int optlen;
> > +	unsigned char *ptr;

u8 would we more appropriate type for binary data.

> > +	bool order1, res;
> > +	unsigned int i;
> > +
> > +	if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn) {
> > +		if (estimate_ecnfield) {
> > +			u8 ecnfield = estimate_ecnfield - 1;
> > +
> > +			tp->delivered_ecn_bytes[ecnfield] += delivered_bytes;
> > +			return true;
> > +		}
> > +		return false;
> > +	}
> > +
> > +	ptr = skb_transport_header(skb) + tp->rx_opt.accecn;
> > +	optlen = ptr[1] - 2;
> 
> This assumes optlen is greater then 2, but I don't see the relevant
> check.

The options parser should check that, please see the "silly options" 
check.

> Are tcp options present at all?

There is !tp->rx_opt.accecn check above which should ensure we're 
processing only AccECN Option that is present.

> > +	WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1);
> 
> Please, don't warn for arbitrary wrong data sent from the peer.

If there isn't AccECN option at ptr, there's bug elsewhere in the code 
(in the option parse code). So this is an internal sanity check that 
tp->rx_opt.accecn points to AccECN option for real like it should.

If you still want that removed, no problem but it's should not be 
arbitrary data at this point because the options parsing code should
have validated this condition already, thus WARN_ON_ONCE() seemed 
appropriate to me.

> > +	order1 = (ptr[0] == TCPOPT_ACCECN1);
> > +	ptr += 2;
> > +
> > +	res = !!estimate_ecnfield;
> > +	for (i = 0; i < 3; i++) {
> > +		if (optlen >= TCPOLEN_ACCECN_PERFIELD) {

It's easy to reverse logic here and use continue, which buys one level of 
indentation.

> > +			u32 init_offset;
> > +			u8 ecnfield;
> > +			s32 delta;
> > +			u32 *cnt;
> > +
> > +			ecnfield = tcp_accecn_optfield_to_ecnfield(i, order1);
> > +			init_offset = tcp_accecn_field_init_offset(ecnfield);
> > +			cnt = &tp->delivered_ecn_bytes[ecnfield - 1];
> > +			delta = tcp_update_ecn_bytes(cnt, ptr, init_offset);
> > +			if (delta) {
> > +				if (delta < 0) {
> > +					res = false;
> > +					ambiguous_ecn_bytes_incr = true;
> > +				}
> > +				if (ecnfield != estimate_ecnfield) {
> > +					if (!first_changed) {
> > +						tp->est_ecnfield = ecnfield;
> > +						first_changed = true;
> > +					} else {
> > +						res = false;
> > +						ambiguous_ecn_bytes_incr = true;
> > +					}
> 
> At least 2 indentation levels above the maximum readable.
> 
> [...]
> > @@ -4378,6 +4524,7 @@ void tcp_parse_options(const struct net *net,
> >  
> >  	ptr = (const unsigned char *)(th + 1);
> >  	opt_rx->saw_tstamp = 0;
> > +	opt_rx->accecn = 0;
> >  	opt_rx->saw_unknown = 0;
> 
> It would be good to be able to zero both 'accecn' and 'saw_unknown' with
> a single statement.
> 
> [...]
> > @@ -766,6 +769,47 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> >  		*ptr++ = htonl(opts->tsecr);
> >  	}
> >  
> > +	if (OPTION_ACCECN & options) {
> > +		const u8 ect0_idx = INET_ECN_ECT_0 - 1;
> > +		const u8 ect1_idx = INET_ECN_ECT_1 - 1;
> > +		const u8 ce_idx = INET_ECN_CE - 1;
> > +		u32 e0b;
> > +		u32 e1b;
> > +		u32 ceb;
> > +		u8 len;
> > +
> > +		e0b = opts->ecn_bytes[ect0_idx] + TCP_ACCECN_E0B_INIT_OFFSET;
> > +		e1b = opts->ecn_bytes[ect1_idx] + TCP_ACCECN_E1B_INIT_OFFSET;
> > +		ceb = opts->ecn_bytes[ce_idx] + TCP_ACCECN_CEB_INIT_OFFSET;
> > +		len = TCPOLEN_ACCECN_BASE +
> > +		      opts->num_accecn_fields * TCPOLEN_ACCECN_PERFIELD;
> > +
> > +		if (opts->num_accecn_fields == 2) {
> > +			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > +				       ((e1b >> 8) & 0xffff));
> > +			*ptr++ = htonl(((e1b & 0xff) << 24) |
> > +				       (ceb & 0xffffff));
> > +		} else if (opts->num_accecn_fields == 1) {
> > +			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > +				       ((e1b >> 8) & 0xffff));
> > +			leftover_bytes = ((e1b & 0xff) << 8) |
> > +					 TCPOPT_NOP;
> > +			leftover_size = 1;
> > +		} else if (opts->num_accecn_fields == 0) {
> > +			leftover_bytes = (TCPOPT_ACCECN1 << 8) | len;
> > +			leftover_size = 2;
> > +		} else if (opts->num_accecn_fields == 3) {
> > +			*ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > +				       ((e1b >> 8) & 0xffff));
> > +			*ptr++ = htonl(((e1b & 0xff) << 24) |
> > +				       (ceb & 0xffffff));
> > +			*ptr++ = htonl(((e0b & 0xffffff) << 8) |
> > +				       TCPOPT_NOP);
> 
> The above chunck and the contents of patch 7 must be in the same patch.
> This split makes the review even harder.
> 
> [...]
> > @@ -1117,6 +1235,17 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> >  		opts->num_sack_blocks = 0;
> >  	}
> >  
> > +	if (tcp_ecn_mode_accecn(tp) &&
> > +	    sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> > +		int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> > +		int remaining = MAX_TCP_OPTION_SPACE - size;
> 
> AFACS the above means tcp_options_fit_accecn() must clear any already
> set options, but apparently it does not do so. Have you tested with
> something adding largish options like mptcp?

This "fitting" for AccEcn option is not to make room for the option but to 
check if AccECN option fits and in what length, and how it can take 
advantage of some nop bytes when available to save option space.

Chia-Yu Chang (Nokia) May 6, 2025, 8:48 a.m. UTC | #4

> -----Original Message-----
> From: Ilpo Järvinen <ij@kernel.org> 
> Sent: Tuesday, May 6, 2025 12:54 AM
> To: Paolo Abeni <pabeni@redhat.com>
> Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> Subject: Re: [PATCH v5 net-next 09/15] tcp: accecn: AccECN option
> 
> 
> CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> 
> 
> 
> On Tue, 29 Apr 2025, Paolo Abeni wrote:
> 
> > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > > @@ -302,10 +303,13 @@ struct tcp_sock {
> > >     u32     snd_up;         /* Urgent pointer               */
> > >     u32     delivered;      /* Total data packets delivered incl. rexmits */
> > >     u32     delivered_ce;   /* Like the above but only ECE marked packets */
> > > +   u32     delivered_ecn_bytes[3];
> >
> > This new fields do not belong to this cacheline group. I'm unsure they 
> > belong to fast-path at all. Also u32 will wrap-around very soon.
> >
> > [...]
> > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h 
> > > index dc8fdc80e16b..74ac8a5d2e00 100644
> > > --- a/include/uapi/linux/tcp.h
> > > +++ b/include/uapi/linux/tcp.h
> > > @@ -298,6 +298,13 @@ struct tcp_info {
> > >     __u32   tcpi_snd_wnd;        /* peer's advertised receive window after
> > >                                   * scaling (bytes)
> > >                                   */
> > > +   __u32   tcpi_received_ce;    /* # of CE marks received */
> > > +   __u32   tcpi_delivered_e1_bytes;  /* Accurate ECN byte counters */
> > > +   __u32   tcpi_delivered_e0_bytes;
> > > +   __u32   tcpi_delivered_ce_bytes;
> > > +   __u32   tcpi_received_e1_bytes;
> > > +   __u32   tcpi_received_e0_bytes;
> > > +   __u32   tcpi_received_ce_bytes;
> >
> > This will break uAPI: new fields must be addded at the end, or must 
> > fill existing holes. Also u32 set in stone in uAPI for a byte counter 
> > looks way too small.
> >
> > > @@ -5100,7 +5113,7 @@ static void __init tcp_struct_check(void)
> > >     /* 32bit arches with 8byte alignment on u64 fields might need padding
> > >      * before tcp_clock_cache.
> > >      */
> > > -   CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 109 + 7);
> > > +   CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, 
> > > + tcp_sock_write_txrx, 122 + 6);
> >
> > The above means an additional cacheline in fast-path WRT the current 
> > status. IMHO should be avoided.
> >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 
> > > 5bd7fc9bcf66..41e45b9aff3f 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -70,6 +70,7 @@
> > >  #include <linux/sysctl.h>
> > >  #include <linux/kernel.h>
> > >  #include <linux/prefetch.h>
> > > +#include <linux/bitops.h>
> > >  #include <net/dst.h>
> > >  #include <net/tcp.h>
> > >  #include <net/proto_memory.h>
> > > @@ -499,6 +500,144 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
> > >     return false;
> > >  }
> > >
> > > +/* Maps IP ECN field ECT/CE code point to AccECN option field 
> > > +number, given
> > > + * we are sending fields with Accurate ECN Order 1: ECT(1), CE, ECT(0).
> > > + */
> > > +static u8 tcp_ecnfield_to_accecn_optfield(u8 ecnfield) {
> > > +   switch (ecnfield) {
> > > +   case INET_ECN_NOT_ECT:
> > > +           return 0;       /* AccECN does not send counts of NOT_ECT */
> > > +   case INET_ECN_ECT_1:
> > > +           return 1;
> > > +   case INET_ECN_CE:
> > > +           return 2;
> > > +   case INET_ECN_ECT_0:
> > > +           return 3;
> > > +   default:
> > > +           WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> >
> > No WARN_ONCE() above please: either the 'ecnfield' data is masked vs 
> > INET_ECN_MASK and the WARN_ONCE should not be possible or a remote 
> > sender can deterministically trigger a WARN() which nowadays will in 
> > turn raise a CVE...
> >
> > [...]
> > > +static u32 tcp_accecn_field_init_offset(u8 ecnfield) {
> > > +   switch (ecnfield) {
> > > +   case INET_ECN_NOT_ECT:
> > > +           return 0;       /* AccECN does not send counts of NOT_ECT */
> > > +   case INET_ECN_ECT_1:
> > > +           return TCP_ACCECN_E1B_INIT_OFFSET;
> > > +   case INET_ECN_CE:
> > > +           return TCP_ACCECN_CEB_INIT_OFFSET;
> > > +   case INET_ECN_ECT_0:
> > > +           return TCP_ACCECN_E0B_INIT_OFFSET;
> > > +   default:
> > > +           WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> >
> > Same as above.
> >
> > > +   }
> > > +   return 0;
> > > +}
> > > +
> > > +/* Maps AccECN option field #nr to IP ECN field ECT/CE bits */ 
> > > +static unsigned int tcp_accecn_optfield_to_ecnfield(unsigned int optfield,
> > > +                                               bool order) {
> > > +   u8 tmp;
> > > +
> > > +   optfield = order ? 2 - optfield : optfield;
> > > +   tmp = optfield + 2;
> > > +
> > > +   return (tmp + (tmp >> 2)) & INET_ECN_MASK; }
> > > +
> > > +/* Handles AccECN option ECT and CE 24-bit byte counters update 
> > > +into
> > > + * the u32 value in tcp_sock. As we're processing TCP options, it 
> > > +is
> > > + * safe to access from - 1.
> > > + */
> > > +static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 
> > > +init_offset) {
> > > +   u32 truncated = (get_unaligned_be32(from - 1) - init_offset) &
> > > +                   0xFFFFFFU;
> > > +   u32 delta = (truncated - *cnt) & 0xFFFFFFU;
> > > +
> > > +   /* If delta has the highest bit set (24th bit) indicating
> > > +    * negative, sign extend to correct an estimation using
> > > +    * sign_extend32(delta, 24 - 1)
> > > +    */
> > > +   delta = sign_extend32(delta, 23);
> > > +   *cnt += delta;
> > > +   return (s32)delta;
> > > +}
> > > +
> > > +/* Returns true if the byte counters can be used */ static bool 
> > > +tcp_accecn_process_option(struct tcp_sock *tp,
> > > +                                 const struct sk_buff *skb,
> > > +                                 u32 delivered_bytes, int flag) {
> > > +   u8 estimate_ecnfield = tp->est_ecnfield;
> > > +   bool ambiguous_ecn_bytes_incr = false;
> > > +   bool first_changed = false;
> > > +   unsigned int optlen;
> > > +   unsigned char *ptr;
> 
> u8 would we more appropriate type for binary data.

Hi Ilpo,

Not sure I understand your point, could you elaborate which binary data you think shall use u8?

> 
> > > +   bool order1, res;
> > > +   unsigned int i;
> > > +
> > > +   if (!(flag & FLAG_SLOWPATH) || !tp->rx_opt.accecn) {
> > > +           if (estimate_ecnfield) {
> > > +                   u8 ecnfield = estimate_ecnfield - 1;
> > > +
> > > +                   tp->delivered_ecn_bytes[ecnfield] += delivered_bytes;
> > > +                   return true;
> > > +           }
> > > +           return false;
> > > +   }
> > > +
> > > +   ptr = skb_transport_header(skb) + tp->rx_opt.accecn;
> > > +   optlen = ptr[1] - 2;
> >
> > This assumes optlen is greater then 2, but I don't see the relevant 
> > check.
> 
> The options parser should check that, please see the "silly options"
> check.
> 
> > Are tcp options present at all?
> 
> There is !tp->rx_opt.accecn check above which should ensure we're processing only AccECN Option that is present.
> 
> > > +   WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != 
> > > + TCPOPT_ACCECN1);
> >
> > Please, don't warn for arbitrary wrong data sent from the peer.
> 
> If there isn't AccECN option at ptr, there's bug elsewhere in the code (in the option parse code). So this is an internal sanity check that
> tp->rx_opt.accecn points to AccECN option for real like it should.
> 
> If you still want that removed, no problem but it's should not be arbitrary data at this point because the options parsing code should have validated this condition already, thus WARN_ON_ONCE() seemed appropriate to me.

Indeed, then I will keep this for next version, but can be adjust once receiving further feedback.

> 
> > > +   order1 = (ptr[0] == TCPOPT_ACCECN1);
> > > +   ptr += 2;
> > > +
> > > +   res = !!estimate_ecnfield;
> > > +   for (i = 0; i < 3; i++) {
> > > +           if (optlen >= TCPOLEN_ACCECN_PERFIELD) {
> 
> It's easy to reverse logic here and use continue, which buys one level of indentation.

Sure, thanks for explicit suggestion, will do.

Chia-Yu

> 
> > > +                   u32 init_offset;
> > > +                   u8 ecnfield;
> > > +                   s32 delta;
> > > +                   u32 *cnt;
> > > +
> > > +                   ecnfield = tcp_accecn_optfield_to_ecnfield(i, order1);
> > > +                   init_offset = tcp_accecn_field_init_offset(ecnfield);
> > > +                   cnt = &tp->delivered_ecn_bytes[ecnfield - 1];
> > > +                   delta = tcp_update_ecn_bytes(cnt, ptr, init_offset);
> > > +                   if (delta) {
> > > +                           if (delta < 0) {
> > > +                                   res = false;
> > > +                                   ambiguous_ecn_bytes_incr = true;
> > > +                           }
> > > +                           if (ecnfield != estimate_ecnfield) {
> > > +                                   if (!first_changed) {
> > > +                                           tp->est_ecnfield = ecnfield;
> > > +                                           first_changed = true;
> > > +                                   } else {
> > > +                                           res = false;
> > > +                                           ambiguous_ecn_bytes_incr = true;
> > > +                                   }
> >
> > At least 2 indentation levels above the maximum readable.
> >
> > [...]
> > > @@ -4378,6 +4524,7 @@ void tcp_parse_options(const struct net *net,
> > >
> > >     ptr = (const unsigned char *)(th + 1);
> > >     opt_rx->saw_tstamp = 0;
> > > +   opt_rx->accecn = 0;
> > >     opt_rx->saw_unknown = 0;
> >
> > It would be good to be able to zero both 'accecn' and 'saw_unknown' 
> > with a single statement.
> >
> > [...]
> > > @@ -766,6 +769,47 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
> > >             *ptr++ = htonl(opts->tsecr);
> > >     }
> > >
> > > +   if (OPTION_ACCECN & options) {
> > > +           const u8 ect0_idx = INET_ECN_ECT_0 - 1;
> > > +           const u8 ect1_idx = INET_ECN_ECT_1 - 1;
> > > +           const u8 ce_idx = INET_ECN_CE - 1;
> > > +           u32 e0b;
> > > +           u32 e1b;
> > > +           u32 ceb;
> > > +           u8 len;
> > > +
> > > +           e0b = opts->ecn_bytes[ect0_idx] + TCP_ACCECN_E0B_INIT_OFFSET;
> > > +           e1b = opts->ecn_bytes[ect1_idx] + TCP_ACCECN_E1B_INIT_OFFSET;
> > > +           ceb = opts->ecn_bytes[ce_idx] + TCP_ACCECN_CEB_INIT_OFFSET;
> > > +           len = TCPOLEN_ACCECN_BASE +
> > > +                 opts->num_accecn_fields * TCPOLEN_ACCECN_PERFIELD;
> > > +
> > > +           if (opts->num_accecn_fields == 2) {
> > > +                   *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > > +                                  ((e1b >> 8) & 0xffff));
> > > +                   *ptr++ = htonl(((e1b & 0xff) << 24) |
> > > +                                  (ceb & 0xffffff));
> > > +           } else if (opts->num_accecn_fields == 1) {
> > > +                   *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > > +                                  ((e1b >> 8) & 0xffff));
> > > +                   leftover_bytes = ((e1b & 0xff) << 8) |
> > > +                                    TCPOPT_NOP;
> > > +                   leftover_size = 1;
> > > +           } else if (opts->num_accecn_fields == 0) {
> > > +                   leftover_bytes = (TCPOPT_ACCECN1 << 8) | len;
> > > +                   leftover_size = 2;
> > > +           } else if (opts->num_accecn_fields == 3) {
> > > +                   *ptr++ = htonl((TCPOPT_ACCECN1 << 24) | (len << 16) |
> > > +                                  ((e1b >> 8) & 0xffff));
> > > +                   *ptr++ = htonl(((e1b & 0xff) << 24) |
> > > +                                  (ceb & 0xffffff));
> > > +                   *ptr++ = htonl(((e0b & 0xffffff) << 8) |
> > > +                                  TCPOPT_NOP);
> >
> > The above chunck and the contents of patch 7 must be in the same patch.
> > This split makes the review even harder.
> >
> > [...]
> > > @@ -1117,6 +1235,17 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> > >             opts->num_sack_blocks = 0;
> > >     }
> > >
> > > +   if (tcp_ecn_mode_accecn(tp) &&
> > > +       sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> > > +           int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> > > +           int remaining = MAX_TCP_OPTION_SPACE - size;
> >
> > AFACS the above means tcp_options_fit_accecn() must clear any already 
> > set options, but apparently it does not do so. Have you tested with 
> > something adding largish options like mptcp?
> 
> This "fitting" for AccEcn option is not to make room for the option but to check if AccECN option fits and in what length, and how it can take advantage of some nop bytes when available to save option space.
> 
> --
>  i.

Ilpo Järvinen May 6, 2025, 5:40 p.m. UTC | #5

On Tue, 6 May 2025, Chia-Yu Chang (Nokia) wrote:

> > -----Original Message-----
> > From: Ilpo Järvinen <ij@kernel.org> 
> > Sent: Tuesday, May 6, 2025 12:54 AM
> > To: Paolo Abeni <pabeni@redhat.com>
> > Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com>
> > Subject: Re: [PATCH v5 net-next 09/15] tcp: accecn: AccECN option
> > 
> > 
> > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information.
> > 
> > 
> > 
> > On Tue, 29 Apr 2025, Paolo Abeni wrote:
> > 
> > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:
> > > > @@ -302,10 +303,13 @@ struct tcp_sock {
> > > >     u32     snd_up;         /* Urgent pointer               */
> > > >     u32     delivered;      /* Total data packets delivered incl. rexmits */
> > > >     u32     delivered_ce;   /* Like the above but only ECE marked packets */
> > > > +   u32     delivered_ecn_bytes[3];
> > >
> > > This new fields do not belong to this cacheline group. I'm unsure they 
> > > belong to fast-path at all. Also u32 will wrap-around very soon.
> > >
> > > [...]
> > > > diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h 
> > > > index dc8fdc80e16b..74ac8a5d2e00 100644
> > > > --- a/include/uapi/linux/tcp.h
> > > > +++ b/include/uapi/linux/tcp.h
> > > > @@ -298,6 +298,13 @@ struct tcp_info {
> > > >     __u32   tcpi_snd_wnd;        /* peer's advertised receive window after
> > > >                                   * scaling (bytes)
> > > >                                   */
> > > > +   __u32   tcpi_received_ce;    /* # of CE marks received */
> > > > +   __u32   tcpi_delivered_e1_bytes;  /* Accurate ECN byte counters */
> > > > +   __u32   tcpi_delivered_e0_bytes;
> > > > +   __u32   tcpi_delivered_ce_bytes;
> > > > +   __u32   tcpi_received_e1_bytes;
> > > > +   __u32   tcpi_received_e0_bytes;
> > > > +   __u32   tcpi_received_ce_bytes;
> > >
> > > This will break uAPI: new fields must be addded at the end, or must 
> > > fill existing holes. Also u32 set in stone in uAPI for a byte counter 
> > > looks way too small.
> > >
> > > > @@ -5100,7 +5113,7 @@ static void __init tcp_struct_check(void)
> > > >     /* 32bit arches with 8byte alignment on u64 fields might need padding
> > > >      * before tcp_clock_cache.
> > > >      */
> > > > -   CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 109 + 7);
> > > > +   CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, 
> > > > + tcp_sock_write_txrx, 122 + 6);
> > >
> > > The above means an additional cacheline in fast-path WRT the current 
> > > status. IMHO should be avoided.
> > >
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 
> > > > 5bd7fc9bcf66..41e45b9aff3f 100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -70,6 +70,7 @@
> > > >  #include <linux/sysctl.h>
> > > >  #include <linux/kernel.h>
> > > >  #include <linux/prefetch.h>
> > > > +#include <linux/bitops.h>
> > > >  #include <net/dst.h>
> > > >  #include <net/tcp.h>
> > > >  #include <net/proto_memory.h>
> > > > @@ -499,6 +500,144 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
> > > >     return false;
> > > >  }
> > > >
> > > > +/* Maps IP ECN field ECT/CE code point to AccECN option field 
> > > > +number, given
> > > > + * we are sending fields with Accurate ECN Order 1: ECT(1), CE, ECT(0).
> > > > + */
> > > > +static u8 tcp_ecnfield_to_accecn_optfield(u8 ecnfield) {
> > > > +   switch (ecnfield) {
> > > > +   case INET_ECN_NOT_ECT:
> > > > +           return 0;       /* AccECN does not send counts of NOT_ECT */
> > > > +   case INET_ECN_ECT_1:
> > > > +           return 1;
> > > > +   case INET_ECN_CE:
> > > > +           return 2;
> > > > +   case INET_ECN_ECT_0:
> > > > +           return 3;
> > > > +   default:
> > > > +           WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> > >
> > > No WARN_ONCE() above please: either the 'ecnfield' data is masked vs 
> > > INET_ECN_MASK and the WARN_ONCE should not be possible or a remote 
> > > sender can deterministically trigger a WARN() which nowadays will in 
> > > turn raise a CVE...
> > >
> > > [...]
> > > > +static u32 tcp_accecn_field_init_offset(u8 ecnfield) {
> > > > +   switch (ecnfield) {
> > > > +   case INET_ECN_NOT_ECT:
> > > > +           return 0;       /* AccECN does not send counts of NOT_ECT */
> > > > +   case INET_ECN_ECT_1:
> > > > +           return TCP_ACCECN_E1B_INIT_OFFSET;
> > > > +   case INET_ECN_CE:
> > > > +           return TCP_ACCECN_CEB_INIT_OFFSET;
> > > > +   case INET_ECN_ECT_0:
> > > > +           return TCP_ACCECN_E0B_INIT_OFFSET;
> > > > +   default:
> > > > +           WARN_ONCE(1, "bad ECN code point: %d\n", ecnfield);
> > >
> > > Same as above.
> > >
> > > > +   }
> > > > +   return 0;
> > > > +}
> > > > +
> > > > +/* Maps AccECN option field #nr to IP ECN field ECT/CE bits */ 
> > > > +static unsigned int tcp_accecn_optfield_to_ecnfield(unsigned int optfield,
> > > > +                                               bool order) {
> > > > +   u8 tmp;
> > > > +
> > > > +   optfield = order ? 2 - optfield : optfield;
> > > > +   tmp = optfield + 2;
> > > > +
> > > > +   return (tmp + (tmp >> 2)) & INET_ECN_MASK; }
> > > > +
> > > > +/* Handles AccECN option ECT and CE 24-bit byte counters update 
> > > > +into
> > > > + * the u32 value in tcp_sock. As we're processing TCP options, it 
> > > > +is
> > > > + * safe to access from - 1.
> > > > + */
> > > > +static s32 tcp_update_ecn_bytes(u32 *cnt, const char *from, u32 
> > > > +init_offset) {
> > > > +   u32 truncated = (get_unaligned_be32(from - 1) - init_offset) &
> > > > +                   0xFFFFFFU;
> > > > +   u32 delta = (truncated - *cnt) & 0xFFFFFFU;
> > > > +
> > > > +   /* If delta has the highest bit set (24th bit) indicating
> > > > +    * negative, sign extend to correct an estimation using
> > > > +    * sign_extend32(delta, 24 - 1)
> > > > +    */
> > > > +   delta = sign_extend32(delta, 23);
> > > > +   *cnt += delta;
> > > > +   return (s32)delta;
> > > > +}
> > > > +
> > > > +/* Returns true if the byte counters can be used */ static bool 
> > > > +tcp_accecn_process_option(struct tcp_sock *tp,
> > > > +                                 const struct sk_buff *skb,
> > > > +                                 u32 delivered_bytes, int flag) {
> > > > +   u8 estimate_ecnfield = tp->est_ecnfield;
> > > > +   bool ambiguous_ecn_bytes_incr = false;
> > > > +   bool first_changed = false;
> > > > +   unsigned int optlen;
> > > > +   unsigned char *ptr;
> > 
> > u8 would we more appropriate type for binary data.
> 
> Hi Ilpo,
> 
> Not sure I understand your point, could you elaborate which binary data 
> you think shall use u8?

The header/option is binary data so u8 seems the right type for it. So:

u8 *ptr;

--
 i.

Ilpo Järvinen May 6, 2025, 5:49 p.m. UTC | #6

On Tue, 6 May 2025, Ilpo Järvinen wrote:
> On Tue, 29 Apr 2025, Paolo Abeni wrote:
> > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote:

> > > @@ -1117,6 +1235,17 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb
> > >  		opts->num_sack_blocks = 0;
> > >  	}
> > >  
> > > +	if (tcp_ecn_mode_accecn(tp) &&
> > > +	    sock_net(sk)->ipv4.sysctl_tcp_ecn_option) {
> > > +		int saving = opts->num_sack_blocks > 0 ? 2 : 0;
> > > +		int remaining = MAX_TCP_OPTION_SPACE - size;
> > 
> > AFACS the above means tcp_options_fit_accecn() must clear any already
> > set options, but apparently it does not do so. Have you tested with
> > something adding largish options like mptcp?
> 
> This "fitting" for AccEcn option is not to make room for the option but to 
> check if AccECN option fits and in what length, and how it can take 
> advantage of some nop bytes when available to save option space.

A minor correction. SACK blocks will naturally fill the entire option 
space if there are enough holes which would "starve" AccECN from using 
option space during loss recovery. Thus, AccECN option is allowed allowed
grab some of that space from SACK. There's redundancy in SACK blocks 
anyway so it shouldn't usually impact SACK signal much.

[v5,net-next,09/15] tcp: accecn: AccECN option

Commit Message

Comments

Patch