Message ID | 20250422153602.54787-1-chia-yu.chang@nokia-bell-labs.com |
---|---|
Headers | show |
Series | AccECN protocol patch series | expand |
On Tue, Apr 22, 2025 at 05:35:47PM +0200, chia-yu.chang@nokia-bell-labs.com wrote: > From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > Hello, > > Plese find the v5: > > v5 (22-Apr-2025) > - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms@kernel.org>) > > v4 (18-Apr-2025) > - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms@kernel.org>) Thanks, I confirm that v6 appears to be clear wrt these build checks for 32-bit ARM assertion for alignment requirements. ...
On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > @@ -555,6 +556,30 @@ static void smc_check_reset_syn_req(const struct tcp_sock *oldtp, > #endif > } > > +u8 tcp_accecn_option_init(const struct sk_buff *skb, u8 opt_offset) > +{ > + unsigned char *ptr = skb_transport_header(skb) + opt_offset; > + unsigned int optlen = ptr[1] - 2; > + > + WARN_ON_ONCE(ptr[0] != TCPOPT_ACCECN0 && ptr[0] != TCPOPT_ACCECN1); This warn shoul be dropped, too. /P
> -----Original Message----- > From: Paolo Abeni <pabeni@redhat.com> > Sent: Tuesday, April 29, 2025 12:14 PM > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > Cc: Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com> > Subject: Re: [PATCH v5 net-next 03/15] tcp: AccECN core > > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information. > > > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > > @@ -298,6 +298,9 @@ struct tcp_sock { > > u32 snd_up; /* Urgent pointer */ > > u32 delivered; /* Total data packets delivered incl. rexmits */ > > u32 delivered_ce; /* Like the above but only ECE marked packets */ > > + u32 received_ce; /* Like the above but for rcvd CE marked pkts */ > > + u8 received_ce_pending:4, /* Not yet transmit cnt of received_ce */ > > + unused2:4; > > AFAICS this uses a 4 bytes hole present prior to this patch after "rcv_wnd", leaving a 3 bytes hole after 'unused2'. Possibly should be worth mentioning the hole presence. > > @Eric: would it make sense use this hole for 'noneagle'/'rate_app_limited' and shrink the 'tcp_sock_write_txrx' group a bit? Hi Paolo, Thanks for the feedback and sorry for my late response. I can either mention it here or move the places. However, as the following patches will continue change holes, so maybe I mention the hole change per patch make it more understandable. If this is find for you, then I will make revision in the next version. > > [...] > > @@ -5095,7 +5097,7 @@ static void __init tcp_struct_check(void) > > /* 32bit arches with 8byte alignment on u64 fields might need padding > > * before tcp_clock_cache. > > */ > > - CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 92 + 4); > > + CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, > > + tcp_sock_write_txrx, 97 + 7); > > Really? I *think* the change here should not move the cacheline end around, due to holes. Could you please include the relevant pahole > (trimmed) output prior to this patch and after in the commit message? > Here is pahole output before and after this patch. Indeed, it creates 3 bytes hole after 'unused2' so it shall add (5+3)=8 to the original 92 + 4. Finally, it will be 92 + 4 + (5 + 3) = 97 + 7. *BEFORE this patch* __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2585 0 */ /* XXX 3 bytes hole, try to pack */ __be32 pred_flags; /* 2588 4 */ u64 tcp_clock_cache; /* 2592 8 */ u64 tcp_mstamp; /* 2600 8 */ u32 rcv_nxt; /* 2608 4 */ u32 snd_nxt; /* 2612 4 */ u32 snd_una; /* 2616 4 */ u32 window_clamp; /* 2620 4 */ /* --- cacheline 41 boundary (2624 bytes) --- */ u32 srtt_us; /* 2624 4 */ u32 packets_out; /* 2628 4 */ u32 snd_up; /* 2632 4 */ u32 delivered; /* 2636 4 */ u32 delivered_ce; /* 2640 4 */ u32 app_limited; /* 2644 4 */ u32 rcv_wnd; /* 2648 4 */ struct tcp_options_received rx_opt; /* 2652 24 */ u8 nonagle:4; /* 2676: 0 1 */ u8 rate_app_limited:1; /* 2676: 4 1 */ /* XXX 3 bits hole, try to pack */ __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2677 0 */ /* XXX 3 bytes hole, try to pack */ *AFTER this patch* __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2585 0 */ /* XXX 3 bytes hole, try to pack */ __be32 pred_flags; /* 2588 4 */ u64 tcp_clock_cache; /* 2592 8 */ u64 tcp_mstamp; /* 2600 8 */ u32 rcv_nxt; /* 2608 4 */ u32 snd_nxt; /* 2612 4 */ u32 snd_una; /* 2616 4 */ u32 window_clamp; /* 2620 4 */ /* --- cacheline 41 boundary (2624 bytes) --- */ u32 srtt_us; /* 2624 4 */ u32 packets_out; /* 2628 4 */ u32 snd_up; /* 2632 4 */ u32 delivered; /* 2636 4 */ u32 delivered_ce; /* 2640 4 */ u32 received_ce; /* 2644 4 */ u8 received_ce_pending:4; /* 2648: 0 1 */ u8 unused2:4; /* 2648: 4 1 */ /* XXX 3 bytes hole, try to pack */ u32 app_limited; /* 2652 4 */ u32 rcv_wnd; /* 2656 4 */ struct tcp_options_received rx_opt; /* 2660 24 */ u8 nonagle:4; /* 2684: 0 1 */ u8 rate_app_limited:1; /* 2684: 4 1 */ /* XXX 3 bits hole, try to pack */ __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2685 0 */ > [...] > > @@ -384,17 +387,16 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb) > > if (tcp_ca_needs_ecn(sk)) > > tcp_ca_event(sk, CA_EVENT_ECN_IS_CE); > > > > - if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) { > > + if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) && > > + tcp_ecn_mode_rfc3168(tp)) { > > /* Better not delay acks, sender can have a very low cwnd */ > > tcp_enter_quickack_mode(sk, 2); > > tp->ecn_flags |= TCP_ECN_DEMAND_CWR; > > } > > - tp->ecn_flags |= TCP_ECN_SEEN; > > At this point is not entirely clear to me why the removal of the above line is needed/correct. What I see is to move the place to set this flag from here to tcp_ecn_received_counters(). Also, this function will work when receiving data by calling tcp_event_data_recv(). While tcp_ecn_received_counters() takes effect in more places (e.g., either len <= tcp_header_len or NOT) to ensure ACE counter tracks all segments including pure ACKs. > [...] > > @@ -4056,6 +4118,11 @@ static int tcp_ack(struct sock *sk, const > > struct sk_buff *skb, int flag) > > > > tcp_rack_update_reo_wnd(sk, &rs); > > > > + if (tcp_ecn_mode_accecn(tp)) > > + ecn_count = tcp_accecn_process(sk, skb, > > + tp->delivered - delivered, > > + &flag); > > AFAICS the above could set FLAG_ECE in flags, menaning the previous > tcp_clean_rtx_queue() will run with such flag cleared and the later function checking such flag will not. I wondering if this inconsistency could cause problems? This flag set by tcp_accecn_process() will be used by following functions: tcp_in_ack_event(), tcp_fastretrans_alert(). And this shall only impact the AccECN mode. Best regards, Chia-Yu > > /P
> -----Original Message----- > From: Paolo Abeni <pabeni@redhat.com> > Sent: Tuesday, April 29, 2025 2:10 PM > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > Subject: Re: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option send control > > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information. > > > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > > From: Ilpo Järvinen <ij@kernel.org> > > > > Instead of sending the option in every ACK, limit sending to those > > ACKs where the option is necessary: > > - Handshake > > - "Change-triggered ACK" + the ACK following it. The > > 2nd ACK is necessary to unambiguously indicate which > > of the ECN byte counters in increasing. The first > > ACK has two counters increasing due to the ecnfield > > edge. > > - ACKs with CE to allow CEP delta validations to take > > advantage of the option. > > - Force option to be sent every at least once per 2^22 > > bytes. The check is done using the bit edges of the > > byte counters (avoids need for extra variables). > > - AccECN option beacon to send a few times per RTT even if > > nothing in the ECN state requires that. The default is 3 > > times per RTT, and its period can be set via > > sysctl_tcp_ecn_option_beacon. > > > > Signed-off-by: Ilpo Järvinen <ij@kernel.org> > > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > --- > > include/linux/tcp.h | 3 +++ > > include/net/netns/ipv4.h | 1 + > > include/net/tcp.h | 1 + > > net/ipv4/sysctl_net_ipv4.c | 9 ++++++++ > > net/ipv4/tcp.c | 5 ++++- > > net/ipv4/tcp_input.c | 36 +++++++++++++++++++++++++++++++- > > net/ipv4/tcp_ipv4.c | 1 + > > net/ipv4/tcp_minisocks.c | 2 ++ > > net/ipv4/tcp_output.c | 42 ++++++++++++++++++++++++++++++-------- > > 9 files changed, 90 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/tcp.h b/include/linux/tcp.h index > > 0e032d9631ac..acb0727855f8 100644 > > --- a/include/linux/tcp.h > > +++ b/include/linux/tcp.h > > @@ -309,8 +309,11 @@ struct tcp_sock { > > u8 received_ce_pending:4, /* Not yet transmit cnt of received_ce */ > > unused2:4; > > u8 accecn_minlen:2,/* Minimum length of AccECN option sent */ > > + prev_ecnfield:2,/* ECN bits from the previous segment */ > > + accecn_opt_demand:2,/* Demand AccECN option for n next > > + ACKs */ > > est_ecnfield:2;/* ECN field for AccECN delivered estimates */ > > u32 app_limited; /* limited until "delivered" reaches this val */ > > + u64 accecn_opt_tstamp; /* Last AccECN option sent timestamp */ > > AFAICS this field is only access in the tx path, while this chunk belong to the tcp_sock_write_txrx group. > > > @@ -740,6 +740,15 @@ static struct ctl_table ipv4_net_table[] = { > > .extra1 = SYSCTL_ZERO, > > .extra2 = SYSCTL_TWO, > > }, > > + { > > + .procname = "tcp_ecn_option_beacon", > > + .data = &init_net.ipv4.sysctl_tcp_ecn_option_beacon, > > + .maxlen = sizeof(u8), > > + .mode = 0644, > > + .proc_handler = proc_dou8vec_minmax, > > + .extra1 = SYSCTL_ZERO, > > + .extra2 = SYSCTL_FOUR, > > The number of new sysctl is concerning high, and I don't see any documentation update yet. Hi Paolo, The documentation is expected to be at the end of whole AccECN patch https://github.com/L4STeam/linux-net-next/commit/03dcec1aec6aa774da4c1993b38a5b937040a11c Or I can move it next to this patch. > > > @@ -6291,9 +6294,36 @@ void tcp_ecn_received_counters(struct sock *sk, > > const struct sk_buff *skb, > > > > if (payload_len > 0) { > > u8 minlen = > > tcp_ecnfield_to_accecn_optfield(ecnfield); > > + u32 oldbytes = tp->received_ecn_bytes[ecnfield - > > + 1]; > > + > > tp->received_ecn_bytes[ecnfield - 1] += payload_len; > > tp->accecn_minlen = max_t(u8, tp->accecn_minlen, > > minlen); > > + > > + /* Demand AccECN option at least every 2^22 bytes to > > + * avoid overflowing the ECN byte counters. > > + */ > > + if ((tp->received_ecn_bytes[ecnfield - 1] ^ oldbytes) & > > + ~((1 << 22) - 1)) { > > + u8 opt_demand = max_t(u8, 1, > > + > > + tp->accecn_opt_demand); > > + > > + tp->accecn_opt_demand = opt_demand; > > + } > > I guess this explains the u32 values for such counters. Some comments in the previous patch could be useful. Yes, like my previous email says, I would refer to the algorithm in AccECN draft. > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index > > 3f3e285fc973..2e95dad66fe3 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -3451,6 +3451,7 @@ static int __net_init tcp_sk_init(struct net > > *net) { > > net->ipv4.sysctl_tcp_ecn = 2; > > net->ipv4.sysctl_tcp_ecn_option = 2; > > + net->ipv4.sysctl_tcp_ecn_option_beacon = 3; > > net->ipv4.sysctl_tcp_ecn_fallback = 1; > > Human readable macros instead of magic numbers could help. OK, commments will be added here. > > > @@ -1237,13 +1253,18 @@ static unsigned int > > tcp_established_options(struct sock *sk, struct sk_buff *skb > > > > if (tcp_ecn_mode_accecn(tp) && > > sock_net(sk)->ipv4.sysctl_tcp_ecn_option) { > > - int saving = opts->num_sack_blocks > 0 ? 2 : 0; > > - int remaining = MAX_TCP_OPTION_SPACE - size; > > - > > - opts->ecn_bytes = tp->received_ecn_bytes; > > - size += tcp_options_fit_accecn(opts, tp->accecn_minlen, > > - remaining, > > - saving); > > + if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 || > > + tp->accecn_opt_demand || > > + tcp_accecn_option_beacon_check(sk)) { > > Why a nested if here and just not expanding the existing one? Sure, will merge them. Chia-Yu > > /P
On Tue, 29 Apr 2025, Paolo Abeni wrote: > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > > @@ -709,6 +709,8 @@ static __be32 *process_tcp_ao_options(struct tcp_sock *tp, > > return ptr; > > } > > > > +#define NOP_LEFTOVER ((TCPOPT_NOP << 8) | TCPOPT_NOP) > > + > > /* Write previously computed TCP options to the packet. > > * > > * Beware: Something in the Internet is very sensitive to the ordering of > > @@ -727,8 +729,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > struct tcp_out_options *opts, > > struct tcp_key *key) > > { > > + u16 leftover_bytes = NOP_LEFTOVER; /* replace next NOPs if avail */ > > __be32 *ptr = (__be32 *)(th + 1); > > u16 options = opts->options; /* mungable copy */ > > + int leftover_size = 2; > > > > if (tcp_key_is_md5(key)) { > > *ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) | > > @@ -763,17 +767,22 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > } > > > > if (unlikely(OPTION_SACK_ADVERTISE & options)) { > > - *ptr++ = htonl((TCPOPT_NOP << 24) | > > - (TCPOPT_NOP << 16) | > > + *ptr++ = htonl((leftover_bytes << 16) | > > (TCPOPT_SACK_PERM << 8) | > > TCPOLEN_SACK_PERM); > > + leftover_bytes = NOP_LEFTOVER; > > Why? isn't leftover_bytes already == NOP_LEFTOVER? > > > } > > > > if (unlikely(OPTION_WSCALE & options)) { > > - *ptr++ = htonl((TCPOPT_NOP << 24) | > > + u8 highbyte = TCPOPT_NOP; > > + > > + if (unlikely(leftover_size == 1)) > > How can the above conditional be true? > > > + highbyte = leftover_bytes >> 8; > > + *ptr++ = htonl((highbyte << 24) | > > (TCPOPT_WINDOW << 16) | > > (TCPOLEN_WINDOW << 8) | > > opts->ws); > > + leftover_bytes = NOP_LEFTOVER; > > Why? isn't leftover_bytes already == NOP_LEFTOVER? > > > } > > > > if (unlikely(opts->num_sack_blocks)) { > > @@ -781,8 +790,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > tp->duplicate_sack : tp->selective_acks; > > int this_sack; > > > > - *ptr++ = htonl((TCPOPT_NOP << 24) | > > - (TCPOPT_NOP << 16) | > > + *ptr++ = htonl((leftover_bytes << 16) | > > (TCPOPT_SACK << 8) | > > (TCPOLEN_SACK_BASE + (opts->num_sack_blocks * > > TCPOLEN_SACK_PERBLOCK))); > > @@ -794,6 +802,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > } > > > > tp->rx_opt.dsack = 0; > > + } else if (unlikely(leftover_bytes != NOP_LEFTOVER)) { > > I really feel like I'm missing some code chunk, but I don't see any > possible value for leftover_bytes other than NOP_LEFTOVER Hi, I split it this way to keep the generic part of the leftover mechanism in own patch separate from AccECN option change itself (as you did later discover). This is among the most convoluted parts in the entire AccECN patch series so it seemed wise to split it as small as I could. Having those non-sensical looking assigns here were to avoid code churn in the latter patch. The changelog could have stated that clearly though (my fault from years back).
On Mon, 5 May 2025, Chia-Yu Chang (Nokia) wrote: > > -----Original Message----- > > From: Paolo Abeni <pabeni@redhat.com> > > Sent: Tuesday, April 29, 2025 2:10 PM > > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > > Subject: Re: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option send control > > > > > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information. > > > > > > > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > > > From: Ilpo Järvinen <ij@kernel.org> > > > > > > Instead of sending the option in every ACK, limit sending to those > > > ACKs where the option is necessary: > > > - Handshake > > > - "Change-triggered ACK" + the ACK following it. The > > > 2nd ACK is necessary to unambiguously indicate which > > > of the ECN byte counters in increasing. The first > > > ACK has two counters increasing due to the ecnfield > > > edge. > > > - ACKs with CE to allow CEP delta validations to take > > > advantage of the option. > > > - Force option to be sent every at least once per 2^22 > > > bytes. The check is done using the bit edges of the > > > byte counters (avoids need for extra variables). > > > - AccECN option beacon to send a few times per RTT even if > > > nothing in the ECN state requires that. The default is 3 > > > times per RTT, and its period can be set via > > > sysctl_tcp_ecn_option_beacon. > > > > > > Signed-off-by: Ilpo Järvinen <ij@kernel.org> > > > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > > --- > > > include/linux/tcp.h | 3 +++ > > > include/net/netns/ipv4.h | 1 + > > > include/net/tcp.h | 1 + > > > net/ipv4/sysctl_net_ipv4.c | 9 ++++++++ > > > net/ipv4/tcp.c | 5 ++++- > > > net/ipv4/tcp_input.c | 36 +++++++++++++++++++++++++++++++- > > > net/ipv4/tcp_ipv4.c | 1 + > > > net/ipv4/tcp_minisocks.c | 2 ++ > > > net/ipv4/tcp_output.c | 42 ++++++++++++++++++++++++++++++-------- > > > 9 files changed, 90 insertions(+), 10 deletions(-) > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index > > > 3f3e285fc973..2e95dad66fe3 100644 > > > --- a/net/ipv4/tcp_ipv4.c > > > +++ b/net/ipv4/tcp_ipv4.c > > > @@ -3451,6 +3451,7 @@ static int __net_init tcp_sk_init(struct net > > > *net) { > > > net->ipv4.sysctl_tcp_ecn = 2; > > > net->ipv4.sysctl_tcp_ecn_option = 2; > > > + net->ipv4.sysctl_tcp_ecn_option_beacon = 3; > > > net->ipv4.sysctl_tcp_ecn_fallback = 1; > > > > Human readable macros instead of magic numbers could help. > > OK, commments will be added here. Hi, Using named defines to replace literals would be more useful than comments (names can be grepped for, do not fall out-of-sync with code, etc.). > > > @@ -1237,13 +1253,18 @@ static unsigned int > > > tcp_established_options(struct sock *sk, struct sk_buff *skb > > > > > > if (tcp_ecn_mode_accecn(tp) && > > > sock_net(sk)->ipv4.sysctl_tcp_ecn_option) { > > > - int saving = opts->num_sack_blocks > 0 ? 2 : 0; > > > - int remaining = MAX_TCP_OPTION_SPACE - size; > > > - > > > - opts->ecn_bytes = tp->received_ecn_bytes; > > > - size += tcp_options_fit_accecn(opts, tp->accecn_minlen, > > > - remaining, > > > - saving); > > > + if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 || > > > + tp->accecn_opt_demand || > > > + tcp_accecn_option_beacon_check(sk)) { > > > > Why a nested if here and just not expanding the existing one? > > Sure, will merge them. While I don't remember everything that well anymore, this might have been to reduce code churn in some later patch, so it might be worth to check it first (that patch might even fall outside of this series now that these are split into multiple chunks).
> -----Original Message----- > From: Ilpo Järvinen <ij@kernel.org> > Sent: Tuesday, May 6, 2025 1:10 AM > To: Paolo Abeni <pabeni@redhat.com> > Cc: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > Subject: Re: [PATCH v5 net-next 07/15] tcp: allow embedding leftover into option padding > > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information. > > > > On Tue, 29 Apr 2025, Paolo Abeni wrote: > > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > > > @@ -709,6 +709,8 @@ static __be32 *process_tcp_ao_options(struct tcp_sock *tp, > > > return ptr; > > > } > > > > > > +#define NOP_LEFTOVER ((TCPOPT_NOP << 8) | TCPOPT_NOP) > > > + > > > /* Write previously computed TCP options to the packet. > > > * > > > * Beware: Something in the Internet is very sensitive to the > > > ordering of @@ -727,8 +729,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > > struct tcp_out_options *opts, > > > struct tcp_key *key) { > > > + u16 leftover_bytes = NOP_LEFTOVER; /* replace next NOPs if avail */ > > > __be32 *ptr = (__be32 *)(th + 1); > > > u16 options = opts->options; /* mungable copy */ > > > + int leftover_size = 2; > > > > > > if (tcp_key_is_md5(key)) { > > > *ptr++ = htonl((TCPOPT_NOP << 24) | (TCPOPT_NOP << 16) | > > > @@ -763,17 +767,22 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > > } > > > > > > if (unlikely(OPTION_SACK_ADVERTISE & options)) { > > > - *ptr++ = htonl((TCPOPT_NOP << 24) | > > > - (TCPOPT_NOP << 16) | > > > + *ptr++ = htonl((leftover_bytes << 16) | > > > (TCPOPT_SACK_PERM << 8) | > > > TCPOLEN_SACK_PERM); > > > + leftover_bytes = NOP_LEFTOVER; > > > > Why? isn't leftover_bytes already == NOP_LEFTOVER? > > > > > } > > > > > > if (unlikely(OPTION_WSCALE & options)) { > > > - *ptr++ = htonl((TCPOPT_NOP << 24) | > > > + u8 highbyte = TCPOPT_NOP; > > > + > > > + if (unlikely(leftover_size == 1)) > > > > How can the above conditional be true? > > > > > + highbyte = leftover_bytes >> 8; > > > + *ptr++ = htonl((highbyte << 24) | > > > (TCPOPT_WINDOW << 16) | > > > (TCPOLEN_WINDOW << 8) | > > > opts->ws); > > > + leftover_bytes = NOP_LEFTOVER; > > > > Why? isn't leftover_bytes already == NOP_LEFTOVER? > > > > > } > > > > > > if (unlikely(opts->num_sack_blocks)) { @@ -781,8 +790,7 @@ > > > static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > > tp->duplicate_sack : tp->selective_acks; > > > int this_sack; > > > > > > - *ptr++ = htonl((TCPOPT_NOP << 24) | > > > - (TCPOPT_NOP << 16) | > > > + *ptr++ = htonl((leftover_bytes << 16) | > > > (TCPOPT_SACK << 8) | > > > (TCPOLEN_SACK_BASE + (opts->num_sack_blocks * > > > > > > TCPOLEN_SACK_PERBLOCK))); @@ -794,6 +802,10 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, > > > } > > > > > > tp->rx_opt.dsack = 0; > > > + } else if (unlikely(leftover_bytes != NOP_LEFTOVER)) { > > > > I really feel like I'm missing some code chunk, but I don't see any > > possible value for leftover_bytes other than NOP_LEFTOVER > > Hi, > > I split it this way to keep the generic part of the leftover mechanism in own patch separate from AccECN option change itself (as you did later discover). This is among the most convoluted parts in the entire AccECN patch series so it seemed wise to split it as small as I could. Having those non-sensical looking assigns here were to avoid code churn in the latter patch. The changelog could have stated that clearly though (my fault from years back). > > > -- > i. Hi Ilpo, Thanks for further clarifications, and I will add more comments in this patch. Chia-Yu
> -----Original Message----- > From: Ilpo Järvinen <ij@kernel.org> > Sent: Tuesday, May 6, 2025 1:27 AM > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com> > Cc: Paolo Abeni <pabeni@redhat.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > Subject: RE: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option send control > > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information. > > > > On Mon, 5 May 2025, Chia-Yu Chang (Nokia) wrote: > > > > -----Original Message----- > > > From: Paolo Abeni <pabeni@redhat.com> > > > Sent: Tuesday, April 29, 2025 2:10 PM > > > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; > > > horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; > > > bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; > > > jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; > > > xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; > > > edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; > > > ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; > > > linux-kselftest@vger.kernel.org; ij@kernel.org; > > > ncardwell@google.com; Koen De Schepper (Nokia) > > > <koen.de_schepper@nokia-bell-labs.com>; g.white > > > <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; > > > mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; > > > Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > > > Subject: Re: [PATCH v5 net-next 10/15] tcp: accecn: AccECN option > > > send control > > > > > > > > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information. > > > > > > > > > > > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > > > > From: Ilpo Järvinen <ij@kernel.org> > > > > > > > > Instead of sending the option in every ACK, limit sending to those > > > > ACKs where the option is necessary: > > > > - Handshake > > > > - "Change-triggered ACK" + the ACK following it. The > > > > 2nd ACK is necessary to unambiguously indicate which > > > > of the ECN byte counters in increasing. The first > > > > ACK has two counters increasing due to the ecnfield > > > > edge. > > > > - ACKs with CE to allow CEP delta validations to take > > > > advantage of the option. > > > > - Force option to be sent every at least once per 2^22 > > > > bytes. The check is done using the bit edges of the > > > > byte counters (avoids need for extra variables). > > > > - AccECN option beacon to send a few times per RTT even if > > > > nothing in the ECN state requires that. The default is 3 > > > > times per RTT, and its period can be set via > > > > sysctl_tcp_ecn_option_beacon. > > > > > > > > Signed-off-by: Ilpo Järvinen <ij@kernel.org> > > > > Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > > > Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> > > > > --- > > > > include/linux/tcp.h | 3 +++ > > > > include/net/netns/ipv4.h | 1 + > > > > include/net/tcp.h | 1 + > > > > net/ipv4/sysctl_net_ipv4.c | 9 ++++++++ > > > > net/ipv4/tcp.c | 5 ++++- > > > > net/ipv4/tcp_input.c | 36 +++++++++++++++++++++++++++++++- > > > > net/ipv4/tcp_ipv4.c | 1 + > > > > net/ipv4/tcp_minisocks.c | 2 ++ > > > > net/ipv4/tcp_output.c | 42 ++++++++++++++++++++++++++++++-------- > > > > 9 files changed, 90 insertions(+), 10 deletions(-) > > > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index > > > > 3f3e285fc973..2e95dad66fe3 100644 > > > > --- a/net/ipv4/tcp_ipv4.c > > > > +++ b/net/ipv4/tcp_ipv4.c > > > > @@ -3451,6 +3451,7 @@ static int __net_init tcp_sk_init(struct net > > > > *net) { > > > > net->ipv4.sysctl_tcp_ecn = 2; > > > > net->ipv4.sysctl_tcp_ecn_option = 2; > > > > + net->ipv4.sysctl_tcp_ecn_option_beacon = 3; > > > > net->ipv4.sysctl_tcp_ecn_fallback = 1; > > > > > > Human readable macros instead of magic numbers could help. > > > > OK, commments will be added here. > > Hi, > > Using named defines to replace literals would be more useful than comments (names can be grepped for, do not fall out-of-sync with code, etc.). > > > > > @@ -1237,13 +1253,18 @@ static unsigned int > > > > tcp_established_options(struct sock *sk, struct sk_buff *skb > > > > > > > > if (tcp_ecn_mode_accecn(tp) && > > > > sock_net(sk)->ipv4.sysctl_tcp_ecn_option) { > > > > - int saving = opts->num_sack_blocks > 0 ? 2 : 0; > > > > - int remaining = MAX_TCP_OPTION_SPACE - size; > > > > - > > > > - opts->ecn_bytes = tp->received_ecn_bytes; > > > > - size += tcp_options_fit_accecn(opts, tp->accecn_minlen, > > > > - remaining, > > > > - saving); > > > > + if (sock_net(sk)->ipv4.sysctl_tcp_ecn_option >= 2 || > > > > + tp->accecn_opt_demand || > > > > + tcp_accecn_option_beacon_check(sk)) { > > > > > > Why a nested if here and just not expanding the existing one? > > > > Sure, will merge them. > > While I don't remember everything that well anymore, this might have been to reduce code churn in some later patch, so it might be worth to check it first (that patch might even fall outside of this series now that these are split into multiple chunks). > > -- > i. Hi Ilpo, Thanks for this raised point. And I've checked that the condition is changed at the latter patches ("tcp: accecn: AccECN option failure handling"), but with the similar nested if. So, I would try to merge it at the next version, and will check packetdrill for sure. Chia-Yu
> -----Original Message----- > From: Chia-Yu Chang (Nokia) > Sent: Monday, May 5, 2025 5:25 PM > To: Paolo Abeni <pabeni@redhat.com>; horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > Cc: Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com> > Subject: RE: [PATCH v5 net-next 03/15] tcp: AccECN core > > > -----Original Message----- > > From: Paolo Abeni <pabeni@redhat.com> > > Sent: Tuesday, April 29, 2025 12:14 PM > > To: Chia-Yu Chang (Nokia) <chia-yu.chang@nokia-bell-labs.com>; > > horms@kernel.org; dsahern@kernel.org; kuniyu@amazon.com; > > bpf@vger.kernel.org; netdev@vger.kernel.org; dave.taht@gmail.com; > > jhs@mojatatu.com; kuba@kernel.org; stephen@networkplumber.org; > > xiyou.wangcong@gmail.com; jiri@resnulli.us; davem@davemloft.net; > > edumazet@google.com; andrew+netdev@lunn.ch; donald.hunter@gmail.com; > > ast@fiberby.net; liuhangbin@gmail.com; shuah@kernel.org; > > linux-kselftest@vger.kernel.org; ij@kernel.org; ncardwell@google.com; > > Koen De Schepper (Nokia) <koen.de_schepper@nokia-bell-labs.com>; > > g.white <g.white@cablelabs.com>; ingemar.s.johansson@ericsson.com; > > mirja.kuehlewind@ericsson.com; cheshire@apple.com; rs.ietf@gmx.at; > > Jason_Livingood@comcast.com; vidhi_goel <vidhi_goel@apple.com> > > Cc: Olivier Tilmans (Nokia) <olivier.tilmans@nokia.com> > > Subject: Re: [PATCH v5 net-next 03/15] tcp: AccECN core > > > > > > CAUTION: This is an external email. Please be very careful when clicking links or opening attachments. See the URL nok.it/ext for additional information. > > > > > > > > On 4/22/25 5:35 PM, chia-yu.chang@nokia-bell-labs.com wrote: > > > @@ -298,6 +298,9 @@ struct tcp_sock { > > > u32 snd_up; /* Urgent pointer */ > > > u32 delivered; /* Total data packets delivered incl. rexmits */ > > > u32 delivered_ce; /* Like the above but only ECE marked packets */ > > > + u32 received_ce; /* Like the above but for rcvd CE marked pkts */ > > > + u8 received_ce_pending:4, /* Not yet transmit cnt of received_ce */ > > > + unused2:4; > > > > AFAICS this uses a 4 bytes hole present prior to this patch after "rcv_wnd", leaving a 3 bytes hole after 'unused2'. Possibly should be worth mentioning the hole presence. > > > > @Eric: would it make sense use this hole for 'noneagle'/'rate_app_limited' and shrink the 'tcp_sock_write_txrx' group a bit? Hi, By moving noneagle/rate_app_limited in the beginning of this group, I manage to reuse the 3-byte hole in the beginning of __cacheline_group_begin__tcp_sock_write_txrx. Thus, I will include it in the next version, and you can find the pahole results below: /*BEFORE this patch*/ __u8 __cacheline_group_end__tcp_sock_write_tx[0]; /* 2585 0 */ __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2585 0 */ /* XXX 3 bytes hole, try to pack */ __be32 pred_flags; /* 2588 4 */ u64 tcp_clock_cache; /* 2592 8 */ u64 tcp_mstamp; /* 2600 8 */ u32 rcv_nxt; /* 2608 4 */ u32 snd_nxt; /* 2612 4 */ u32 snd_una; /* 2616 4 */ u32 window_clamp; /* 2620 4 */ /* --- cacheline 41 boundary (2624 bytes) --- */ u32 srtt_us; /* 2624 4 */ u32 packets_out; /* 2628 4 */ u32 snd_up; /* 2632 4 */ u32 delivered; /* 2636 4 */ u32 delivered_ce; /* 2640 4 */ u32 app_limited; /* 2644 4 */ u32 rcv_wnd; /* 2648 4 */ struct tcp_options_received rx_opt; /* 2652 24 */ u8 nonagle:4; /* 2676: 0 1 */ u8 rate_app_limited:1; /* 2676: 4 1 */ /* XXX 3 bits hole, try to pack */ __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2677 0 */ /* XXX 3 bytes hole, try to pack */ __u8 __cacheline_group_begin__tcp_sock_write_rx[0] __attribute__((__aligned__(8))); /* 2680 0 */ /*AFTER this patch*/ __u8 __cacheline_group_end__tcp_sock_write_tx[0]; /* 2585 0 */ __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2585 0 */ u8 nonagle:4; /* 2585: 0 1 */ u8 rate_app_limited:1; /* 2585: 4 1 */ /* XXX 3 bits hole, try to pack */ /* Force alignment to the next boundary: */ u8 :0; u8 received_ce_pending:4; /* 2586: 0 1 */ u8 unused2:4; /* 2586: 4 1 */ /* XXX 1 byte hole, try to pack */ __be32 pred_flags; /* 2588 4 */ u64 tcp_clock_cache; /* 2592 8 */ u64 tcp_mstamp; /* 2600 8 */ u32 rcv_nxt; /* 2608 4 */ u32 snd_nxt; /* 2612 4 */ u32 snd_una; /* 2616 4 */ u32 window_clamp; /* 2620 4 */ /* --- cacheline 41 boundary (2624 bytes) --- */ u32 srtt_us; /* 2624 4 */ u32 packets_out; /* 2628 4 */ u32 snd_up; /* 2632 4 */ u32 delivered; /* 2636 4 */ u32 delivered_ce; /* 2640 4 */ u32 received_ce; /* 2644 4 */ u32 app_limited; /* 2648 4 */ u32 rcv_wnd; /* 2652 4 */ struct tcp_options_received rx_opt; /* 2656 24 */ __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2680 0 */ __u8 __cacheline_group_begin__tcp_sock_write_rx[0] __attribute__((__aligned__(8))); /* 2680 0 */ > > Hi Paolo, > > Thanks for the feedback and sorry for my late response. > I can either mention it here or move the places. > However, as the following patches will continue change holes, so maybe I mention the hole change per patch make it more understandable. > If this is find for you, then I will make revision in the next version. > > > > > [...] > > > @@ -5095,7 +5097,7 @@ static void __init tcp_struct_check(void) > > > /* 32bit arches with 8byte alignment on u64 fields might need padding > > > * before tcp_clock_cache. > > > */ > > > - CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, tcp_sock_write_txrx, 92 + 4); > > > + CACHELINE_ASSERT_GROUP_SIZE(struct tcp_sock, > > > + tcp_sock_write_txrx, 97 + 7); > > > > Really? I *think* the change here should not move the cacheline end > > around, due to holes. Could you please include the relevant pahole > > (trimmed) output prior to this patch and after in the commit message? > > > > Here is pahole output before and after this patch. > Indeed, it creates 3 bytes hole after 'unused2' so it shall add (5+3)=8 to the original 92 + 4. > Finally, it will be 92 + 4 + (5 + 3) = 97 + 7. > > *BEFORE this patch* > __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2585 0 */ > > /* XXX 3 bytes hole, try to pack */ > > __be32 pred_flags; /* 2588 4 */ > u64 tcp_clock_cache; /* 2592 8 */ > u64 tcp_mstamp; /* 2600 8 */ > u32 rcv_nxt; /* 2608 4 */ > u32 snd_nxt; /* 2612 4 */ > u32 snd_una; /* 2616 4 */ > u32 window_clamp; /* 2620 4 */ > /* --- cacheline 41 boundary (2624 bytes) --- */ > u32 srtt_us; /* 2624 4 */ > u32 packets_out; /* 2628 4 */ > u32 snd_up; /* 2632 4 */ > u32 delivered; /* 2636 4 */ > u32 delivered_ce; /* 2640 4 */ > u32 app_limited; /* 2644 4 */ > u32 rcv_wnd; /* 2648 4 */ > struct tcp_options_received rx_opt; /* 2652 24 */ > u8 nonagle:4; /* 2676: 0 1 */ > u8 rate_app_limited:1; /* 2676: 4 1 */ > > /* XXX 3 bits hole, try to pack */ > > __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2677 0 */ > > /* XXX 3 bytes hole, try to pack */ > > *AFTER this patch* > __u8 __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2585 0 */ > > /* XXX 3 bytes hole, try to pack */ > > __be32 pred_flags; /* 2588 4 */ > u64 tcp_clock_cache; /* 2592 8 */ > u64 tcp_mstamp; /* 2600 8 */ > u32 rcv_nxt; /* 2608 4 */ > u32 snd_nxt; /* 2612 4 */ > u32 snd_una; /* 2616 4 */ > u32 window_clamp; /* 2620 4 */ > /* --- cacheline 41 boundary (2624 bytes) --- */ > u32 srtt_us; /* 2624 4 */ > u32 packets_out; /* 2628 4 */ > u32 snd_up; /* 2632 4 */ > u32 delivered; /* 2636 4 */ > u32 delivered_ce; /* 2640 4 */ > u32 received_ce; /* 2644 4 */ > u8 received_ce_pending:4; /* 2648: 0 1 */ > u8 unused2:4; /* 2648: 4 1 */ > > /* XXX 3 bytes hole, try to pack */ > > u32 app_limited; /* 2652 4 */ > u32 rcv_wnd; /* 2656 4 */ > struct tcp_options_received rx_opt; /* 2660 24 */ > u8 nonagle:4; /* 2684: 0 1 */ > u8 rate_app_limited:1; /* 2684: 4 1 */ > > /* XXX 3 bits hole, try to pack */ > > __u8 __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2685 0 */ > > > > [...] > > > @@ -384,17 +387,16 @@ static void tcp_data_ecn_check(struct sock *sk, const struct sk_buff *skb) > > > if (tcp_ca_needs_ecn(sk)) > > > tcp_ca_event(sk, CA_EVENT_ECN_IS_CE); > > > > > > - if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) { > > > + if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR) && > > > + tcp_ecn_mode_rfc3168(tp)) { > > > /* Better not delay acks, sender can have a very low cwnd */ > > > tcp_enter_quickack_mode(sk, 2); > > > tp->ecn_flags |= TCP_ECN_DEMAND_CWR; > > > } > > > - tp->ecn_flags |= TCP_ECN_SEEN; > > > > At this point is not entirely clear to me why the removal of the above line is needed/correct. > > What I see is to move the place to set this flag from here to tcp_ecn_received_counters(). > > Also, this function will work when receiving data by calling tcp_event_data_recv(). > > While tcp_ecn_received_counters() takes effect in more places (e.g., either len <= tcp_header_len or NOT) to ensure ACE counter tracks all segments including pure ACKs. > > > [...] > > > @@ -4056,6 +4118,11 @@ static int tcp_ack(struct sock *sk, const > > > struct sk_buff *skb, int flag) > > > > > > tcp_rack_update_reo_wnd(sk, &rs); > > > > > > + if (tcp_ecn_mode_accecn(tp)) > > > + ecn_count = tcp_accecn_process(sk, skb, > > > + tp->delivered - delivered, > > > + &flag); > > > > AFAICS the above could set FLAG_ECE in flags, menaning the previous > > tcp_clean_rtx_queue() will run with such flag cleared and the later function checking such flag will not. I wondering if this inconsistency could cause problems? > > This flag set by tcp_accecn_process() will be used by following functions: tcp_in_ack_event(), tcp_fastretrans_alert(). > > And this shall only impact the AccECN mode. > > Best regards, > Chia-Yu > > > > > /P
From: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Hello, Plese find the v5: v5 (22-Apr-2025) - Further fix for 32-bit ARM alignment in tcp.c (Simon Horman <horms@kernel.org>) v4 (18-Apr-2025) - Fix 32-bit ARM assertion for alignment requirement (Simon Horman <horms@kernel.org>) v3 (14-Apr-2025) - Fix patch apply issue in v2 (Jakub Kicinski <kuba@kernel.org>) v2 (18-Mar-2025) - Add one missing patch from previous AccECN protocol preparation patch series to this patch series The full patch series can be found in https://github.com/L4STeam/linux-net-next/commits/upstream_l4steam/ The Accurate ECN draft can be found in https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-accurate-ecn-28 Best regards, Chia-Yu Chia-Yu Chang (1): tcp: accecn: AccECN option failure handling Ilpo Järvinen (14): tcp: reorganize SYN ECN code tcp: fast path functions later tcp: AccECN core tcp: accecn: AccECN negotiation tcp: accecn: add AccECN rx byte counters tcp: accecn: AccECN needs to know delivered bytes tcp: allow embedding leftover into option padding tcp: sack option handling improvements tcp: accecn: AccECN option tcp: accecn: AccECN option send control tcp: accecn: AccECN option ceb/cep heuristic tcp: accecn: AccECN ACE field multi-wrap heuristic tcp: accecn: try to fit AccECN option with SACK tcp: try to avoid safer when ACKs are thinned include/linux/tcp.h | 27 +- include/net/netns/ipv4.h | 2 + include/net/tcp.h | 198 +++++++++++-- include/uapi/linux/tcp.h | 7 + net/ipv4/syncookies.c | 3 + net/ipv4/sysctl_net_ipv4.c | 19 ++ net/ipv4/tcp.c | 26 +- net/ipv4/tcp_input.c | 591 +++++++++++++++++++++++++++++++++++-- net/ipv4/tcp_ipv4.c | 5 +- net/ipv4/tcp_minisocks.c | 92 +++++- net/ipv4/tcp_output.c | 302 +++++++++++++++++-- net/ipv6/syncookies.c | 1 + net/ipv6/tcp_ipv6.c | 1 + 13 files changed, 1178 insertions(+), 96 deletions(-)