diff mbox series

[RFC] cyclictest: Add skip_sec option

Message ID 20250520035641.533706-1-leobras@redhat.com
State New
Headers show
Series [RFC] cyclictest: Add skip_sec option | expand

Commit Message

Leonardo Bras Soares Passos May 20, 2025, 3:56 a.m. UTC
When running cyclictest with break trace (-b) option, wait for skip_sec
seconds before issuing the break.

Cyclictest may present high latency on the initial cycles, which can be
caused by initial resource allocations, and may not represent the
expected latency for the running workload.

So add skip_sec so user can make cyclictest ignore any break that may
happen in those inital sseconds.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---

This behavior was found by Sana and me when we were testing cyclictest
on an ARM KVM VM, while running trace-cmd at host and guest
(with trace-cmd agent).

 src/cyclictest/cyclictest.c | 11 ++++++++++-
 src/cyclictest/cyclictest.8 |  2 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

Comments

Leonardo Bras Soares Passos May 20, 2025, 2:30 p.m. UTC | #1
On Tue, May 20, 2025 at 4:35 AM John Ogness <john.ogness@linutronix.de> wrote:
>
> On 2025-05-20, Leonardo Bras <leobras@redhat.com> wrote:
> > When running cyclictest with break trace (-b) option, wait for skip_sec
> > seconds before issuing the break.
> >
> > Cyclictest may present high latency on the initial cycles, which can be
> > caused by initial resource allocations, and may not represent the
> > expected latency for the running workload.
>
> If cyclictest is programmed correctly, this will not happen. Can you
> provide a technical explanation for high latencies on initial cycles?

Hi John,

We are currently investigating the source of this latency, but I heard
from other team members it's also happening on Intel Secure VMs as
well.

Scenario:
Host: ARM64, kernel-rt, 120 out of 128 cpu isolated, hugepages for guest
KVM Guest: kernel-rt, 120 vcpus pinned on above isolated cpus, 116
vcpus isolated on guest

Cyclictest runs with trace-cmd attaching to a guest agent:

trace-cmd record --poll -m 1000 -e csd -e sched/sched_switch -e timer
-e irq_handler_entry -e irq_handler_exit -e tick_stop -e
ipi_send_cpumask -e ipi_send_cpu -A 3 -e csd -e sched/sched_switch -e
timer -e irq_handler_entry -e irq_handler_exit -e tick_stop -e
ipi_send_cpumask -e ipi_send_cpu bash -c "ssh $my_guest 'cyclictest -m
-q -p95 --policy=fifo -D 2h -i 200 -h60 -t 116 -a 4-119 -b 50
--mainaffinity 0,1,2,3 --tracemark'"

What we see is a peak of >50us in the first couple cycles, and then a
max peak of 15 in our tests.
We also tested a sleep before the above cyclictest command, in order
to remove any trace-cmd setup that might have been in progress, but it
did not help.

If you have any idea on why this might happen, or suggest a debugging
setup, it would be appreciated.

Thanks!
Leo
John Ogness May 21, 2025, 8:40 a.m. UTC | #2
Hi Leo,

On 2025-05-20, Leonardo Bras Soares Passos <leobras@redhat.com> wrote:
> On Tue, May 20, 2025 at 4:35 AM John Ogness <john.ogness@linutronix.de> wrote:
>>
>> On 2025-05-20, Leonardo Bras <leobras@redhat.com> wrote:
>> > When running cyclictest with break trace (-b) option, wait for skip_sec
>> > seconds before issuing the break.
>> >
>> > Cyclictest may present high latency on the initial cycles, which can be
>> > caused by initial resource allocations, and may not represent the
>> > expected latency for the running workload.
>>
>> If cyclictest is programmed correctly, this will not happen. Can you
>> provide a technical explanation for high latencies on initial cycles?
>
> We are currently investigating the source of this latency, but I heard
> from other team members it's also happening on Intel Secure VMs as
> well.
>
> Scenario:
> Host: ARM64, kernel-rt, 120 out of 128 cpu isolated, hugepages for guest
> KVM Guest: kernel-rt, 120 vcpus pinned on above isolated cpus, 116
> vcpus isolated on guest
>
> Cyclictest runs with trace-cmd attaching to a guest agent:
>
> trace-cmd record --poll -m 1000 -e csd -e sched/sched_switch -e timer
> -e irq_handler_entry -e irq_handler_exit -e tick_stop -e
> ipi_send_cpumask -e ipi_send_cpu -A 3 -e csd -e sched/sched_switch -e
> timer -e irq_handler_entry -e irq_handler_exit -e tick_stop -e
> ipi_send_cpumask -e ipi_send_cpu bash -c "ssh $my_guest 'cyclictest -m
> -q -p95 --policy=fifo -D 2h -i 200 -h60 -t 116 -a 4-119 -b 50
> --mainaffinity 0,1,2,3 --tracemark'"
>
> What we see is a peak of >50us in the first couple cycles, and then a
> max peak of 15 in our tests.

I wonder if this related to cache misses and/or memory bandwidth. If you
keep this patch but you increase the interval, do you see >50us during
your tests? For example:

    -i 10000

If so, your "max peak" of 15us is only valid when cache hot.

IMHO whatever situation is happening for the "first couple cycles" could
happen later. In that case, the patch is just sweeping some of the bad
numbers under the rug. It is really important to understand exactly what
the problem is before changing cyclictest to ignore such problems.

John
Leonardo Bras Soares Passos May 22, 2025, 4:01 a.m. UTC | #3
On Wed, May 21, 2025 at 5:40 AM John Ogness <john.ogness@linutronix.de> wrote:
>
> Hi Leo,
>
> On 2025-05-20, Leonardo Bras Soares Passos <leobras@redhat.com> wrote:
> > On Tue, May 20, 2025 at 4:35 AM John Ogness <john.ogness@linutronix.de> wrote:
> >>
> >> On 2025-05-20, Leonardo Bras <leobras@redhat.com> wrote:
> >> > When running cyclictest with break trace (-b) option, wait for skip_sec
> >> > seconds before issuing the break.
> >> >
> >> > Cyclictest may present high latency on the initial cycles, which can be
> >> > caused by initial resource allocations, and may not represent the
> >> > expected latency for the running workload.
> >>
> >> If cyclictest is programmed correctly, this will not happen. Can you
> >> provide a technical explanation for high latencies on initial cycles?
> >
> > We are currently investigating the source of this latency, but I heard
> > from other team members it's also happening on Intel Secure VMs as
> > well.
> >
> > Scenario:
> > Host: ARM64, kernel-rt, 120 out of 128 cpu isolated, hugepages for guest
> > KVM Guest: kernel-rt, 120 vcpus pinned on above isolated cpus, 116
> > vcpus isolated on guest
> >
> > Cyclictest runs with trace-cmd attaching to a guest agent:
> >
> > trace-cmd record --poll -m 1000 -e csd -e sched/sched_switch -e timer
> > -e irq_handler_entry -e irq_handler_exit -e tick_stop -e
> > ipi_send_cpumask -e ipi_send_cpu -A 3 -e csd -e sched/sched_switch -e
> > timer -e irq_handler_entry -e irq_handler_exit -e tick_stop -e
> > ipi_send_cpumask -e ipi_send_cpu bash -c "ssh $my_guest 'cyclictest -m
> > -q -p95 --policy=fifo -D 2h -i 200 -h60 -t 116 -a 4-119 -b 50
> > --mainaffinity 0,1,2,3 --tracemark'"
> >
> > What we see is a peak of >50us in the first couple cycles, and then a
> > max peak of 15 in our tests.
>
> I wonder if this related to cache misses and/or memory bandwidth. If you
> keep this patch but you increase the interval, do you see >50us during
> your tests? For example:
>
>     -i 10000
>
> If so, your "max peak" of 15us is only valid when cache hot.
>

That's a smart move, we will run this test as suggested and bring the results.

> IMHO whatever situation is happening for the "first couple cycles" could
> happen later. In that case, the patch is just sweeping some of the bad
> numbers under the rug. It is really important to understand exactly what
> the problem is before changing cyclictest to ignore such problems.

That's a valid point.
We are currently investigating it, and I will bring the results when
we have some.

I have a v2 ready, but I think it makes sense to send it once we find
the above investigation if it makes sense to proceed with the
suggestion.

Thanks!
Leo
diff mbox series

Patch

diff --git a/src/cyclictest/cyclictest.c b/src/cyclictest/cyclictest.c
index 890da5d..26e181a 100644
--- a/src/cyclictest/cyclictest.c
+++ b/src/cyclictest/cyclictest.c
@@ -183,20 +183,21 @@  static int use_nsecs = 0;
 static int refresh_on_max;
 static int force_sched_other;
 static int priospread = 0;
 static int check_clock_resolution;
 static int ct_debug;
 static int use_fifo = 0;
 static pthread_t fifo_threadid;
 static int laptop = 0;
 static int power_management = 0;
 static int use_histfile = 0;
+static int skip_sec = 0;
 
 #ifdef ARCH_HAS_SMI_COUNTER
 static int smi = 0;
 #else
 #define smi	0
 #endif
 
 static pthread_cond_t refresh_on_max_cond = PTHREAD_COND_INITIALIZER;
 static pthread_mutex_t refresh_on_max_lock = PTHREAD_MUTEX_INITIALIZER;
 
@@ -637,20 +638,21 @@  static void *timerthread(void *param)
 	sigset_t sigset;
 	timer_t timer;
 	struct timespec now, next, interval, stop = { 0 };
 	struct itimerval itimer;
 	struct itimerspec tspec;
 	struct thread_stat *stat = par->stats;
 	int stopped = 0;
 	cpu_set_t mask;
 	pthread_t thread;
 	unsigned long smi_now, smi_old = 0;
+	unsigned long skip_cycles = skip_sec * USEC_PER_SEC / par->interval;
 
 	/* if we're running in numa mode, set our memory node */
 	if (par->node != -1)
 		rt_numa_set_numa_run_on_node(par->node, par->cpu);
 
 	if (par->cpu != -1) {
 		CPU_ZERO(&mask);
 		CPU_SET(par->cpu, &mask);
 		thread = pthread_self();
 		if (pthread_setaffinity_np(thread, sizeof(mask), &mask) != 0)
@@ -837,21 +839,21 @@  static void *timerthread(void *param)
 				pthread_cond_signal(&refresh_on_max_cond);
 		}
 		stat->avg += (double) diff;
 
 		if (trigger && (diff > trigger))
 			trigger_update(par, diff, calctime(now));
 
 		if (duration && (calcdiff(now, stop) >= 0))
 			shutdown++;
 
-		if (!stopped && tracelimit && (diff > tracelimit)) {
+		if (!stopped && tracelimit && (diff > tracelimit) && stat->cycles > skip_cycles) {
 			stopped++;
 			shutdown++;
 			pthread_mutex_lock(&break_thread_id_lock);
 			if (break_thread_id == 0) {
 				break_thread_id = stat->tid;
 				tracemark("hit latency threshold (%llu > %d)",
 					  (unsigned long long) diff, tracelimit);
 				break_thread_value = diff;
 			}
 			pthread_mutex_unlock(&break_thread_id_lock);
@@ -1081,20 +1083,21 @@  enum option_values {
 	OPT_DEFAULT_SYSTEM, OPT_DISTANCE, OPT_DURATION, OPT_LATENCY,
 	OPT_FIFO, OPT_HISTOGRAM, OPT_HISTOFALL, OPT_HISTFILE,
 	OPT_INTERVAL, OPT_JSON, OPT_MAINAFFINITY, OPT_LOOPS, OPT_MLOCKALL,
 	OPT_REFRESH, OPT_NANOSLEEP, OPT_NSECS, OPT_OSCOPE, OPT_PRIORITY,
 	OPT_QUIET, OPT_PRIOSPREAD, OPT_RELATIVE, OPT_RESOLUTION,
 	OPT_SYSTEM, OPT_SMP, OPT_THREADS, OPT_TRIGGER,
 	OPT_TRIGGER_NODES, OPT_UNBUFFERED, OPT_NUMA, OPT_VERBOSE,
 	OPT_DBGCYCLIC, OPT_POLICY, OPT_HELP, OPT_NUMOPTS,
 	OPT_ALIGNED, OPT_SECALIGNED, OPT_LAPTOP, OPT_SMI,
 	OPT_TRACEMARK, OPT_POSIX_TIMERS, OPT_DEEPEST_IDLE_STATE,
+	OPT_SKIP_SEC
 };
 
 /* Process commandline options */
 static void process_options(int argc, char *argv[], int max_cpus)
 {
 	int error = 0;
 	int option_affinity = 0;
 
 	for (;;) {
 		int option_index = 0;
@@ -1137,20 +1140,21 @@  static void process_options(int argc, char *argv[], int max_cpus)
 			{"spike-nodes",	     required_argument, NULL, OPT_TRIGGER_NODES },
 			{"threads",          optional_argument, NULL, OPT_THREADS },
 			{"tracemark",	     no_argument,	NULL, OPT_TRACEMARK },
 			{"unbuffered",       no_argument,       NULL, OPT_UNBUFFERED },
 			{"verbose",          no_argument,       NULL, OPT_VERBOSE },
 			{"dbg_cyclictest",   no_argument,       NULL, OPT_DBGCYCLIC },
 			{"policy",           required_argument, NULL, OPT_POLICY },
 			{"help",             no_argument,       NULL, OPT_HELP },
 			{"posix_timers",     no_argument,	NULL, OPT_POSIX_TIMERS },
 			{"deepest-idle-state", required_argument,	NULL, OPT_DEEPEST_IDLE_STATE },
+			{"skip_sec",         required_argument,	NULL, OPT_SKIP_SEC },
 			{NULL, 0, NULL, 0 },
 		};
 		int c = getopt_long(argc, argv, "a::A::b:c:d:D:F:h:H:i:l:MNo:p:mqrRsSt::uvD:x",
 				    long_options, &option_index);
 		if (c == -1)
 			break;
 		switch (c) {
 		case 'a':
 		case OPT_AFFINITY:
 			option_affinity = 1;
@@ -1340,20 +1344,25 @@  static void process_options(int argc, char *argv[], int max_cpus)
 			smi = 1;
 #else
 			fatal("--smi is not available on your arch\n");
 #endif
 			break;
 		case OPT_TRACEMARK:
 			trace_marker = 1; break;
 		case OPT_DEEPEST_IDLE_STATE:
 			deepest_idle_state = atoi(optarg);
 			break;
+		case OPT_SKIP_SEC:
+			skip_sec = atoi(optarg);
+			if (skip_sec < 0)
+				skip_sec = 0;
+			break;
 		}
 	}
 
 	if ((use_system == MODE_SYS_OFFSET) && (use_nanosleep == MODE_CYCLIC)) {
 		warn("The system option requires clock_nanosleep\n");
 		warn("and is not compatible with posix_timers\n");
 		warn("Using clock_nanosleep\n");
 		use_nanosleep = MODE_CLOCK_NANOSLEEP;
 	}
 
diff --git a/src/cyclictest/cyclictest.8 b/src/cyclictest/cyclictest.8
index 6becd19..a0eaf7e 100644
--- a/src/cyclictest/cyclictest.8
+++ b/src/cyclictest/cyclictest.8
@@ -174,20 +174,22 @@  Output values on stdout for statistics. This option is used to gather statistica
 
 n:c:v
 
 where n=task number c=count v=latency value in us.
 .TP
 .B \-\-dbg_cyclictest
 Print info userful for debugging cyclictest
 .TP
 .B \-x, \-\-posix_timers
 Use POSIX timers instead of clock_nanosleep.
+.B \-\-skip_sec=SEC
+When using -b, skip sending break trace command for the first SEC seconds.
 
 .SH SEE ALSO
 .BR numa (3),
 .BR numactl (8),
 .\" .br
 .\" The programs are documented fully by
 .\" .IR "The Rise and Fall of a Fooish Bar" ,
 .\" available via the Info system.
 .SH AUTHOR
 cyclictest was written by Thomas Gleixner <tglx@linuxtronix.de>.