Message ID | 20170421131134.27992-9-petri.savolainen@linaro.org |
---|---|
State | Superseded |
Headers | show |
Series | Use HW time counter | expand |
On 04/26 07:11:57, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > From coverletter: > > > "This patch set modifies time implementation to use TSC when running on > > a x86 > > > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system > > time > > > is used as before. TSC is much more efficient both in performance and > > > latency/jitter wise than Linux system call. This can be seen also with > > > scheduler latency test which time stamps events with this API. All > > latency > > > measurements (min, ave, max) improved significantly." > > > > odp_sched_latency currently uses clock_gettime. It is my understanding > > that clock_gettime does not have the over head of the system call. Can > > you elaborate more on the 'improved significantly' part? > > > > clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of kernel functions including system call entry, RCU maintenance, etc. > > E.g. in sched_latency test kernel consumed about 10% of all the cycles. Also latency measurement results improved like this: > * min >3x lower > * avg 2x lower > * max more stable and 50% lower You might want to share more information on the environment where you're seeing such significant improvements because the results on Broadwell do not match the above interpretation. PS - This patch series breaks the build on ARM. before / after numbers on 2650v4 (HT disabled, Linux 4.4.6, GCC 5.3.1): before: HIGH priority Thread Avg[ns] Min[ns] Max[ns] Samples Total --------------------------------------------------------------- 1 619 351 2463 2103 2103 2 652 382 1509 2019 2019 3 637 360 1950 1867 1867 4 606 373 2328 2073 2073 5 611 371 2677 2096 2096 6 643 378 3045 2106 2106 7 631 354 1677 1923 1923 8 603 367 4721 2054 2054 9 617 373 1524 2111 2111 10 641 369 1808 2024 2024 --------------------------------------------------------------- Total 626 351 4721 20376 20376 LOW priority Thread Avg[ns] Min[ns] Max[ns] Samples Total --------------------------------------------------------------- 1 30498 480 914522 2097 4192201 2 44302 491 584995 1980 4192285 3 84258 680 515286 2001 4192437 4 127714 746 473280 2011 4192231 5 24568 455 724637 2109 4192208 6 42436 473 523936 2041 4192198 7 85438 554 486851 2017 4192381 8 126164 841 203464 2058 4192250 9 23085 492 671478 2091 4192192 10 41748 499 515091 1970 4192280 --------------------------------------------------------------- Total 62725 455 914522 20375 41922663 after: HIGH priority Thread Avg[ns] Min[ns] Max[ns] Samples Total --------------------------------------------------------------- 1 523 202 4671 2152 2152 2 551 276 1540 2058 2058 3 492 257 1274 1928 1928 4 496 269 1201 2035 2035 5 520 252 1506 2165 2165 6 548 291 1540 2002 2002 7 491 251 1274 1969 1969 8 486 259 3007 1951 1951 9 528 276 1601 2091 2091 10 555 264 1611 2001 2001 --------------------------------------------------------------- Total 519 202 4671 20352 20352 LOW priority Thread Avg[ns] Min[ns] Max[ns] Samples Total --------------------------------------------------------------- 1 28303 432 828632 2141 4192152 2 43635 373 662999 2005 4192246 3 85032 579 664442 1991 4192376 4 128053 471 306203 2066 4192269 5 25289 431 591153 2139 4192139 6 41118 362 693192 2013 4192302 7 87817 523 696300 2090 4192335 8 128484 398 232439 2008 4192353 9 23565 507 716741 1969 4192212 10 41952 338 614098 1929 4192303 --------------------------------------------------------------- Total 63273 338 828632 20351 41922687 > -Petri >
On 04/26 07:30:15, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > This function (cpu_global_time()) is called only when we have first > > checked that TSC is invariant. Also we measure the TSC frequency in that > > case. This function is defined in the same file as cpu_cycles(), and the > > file is x86 specific. So, we know what we are doing, and just re-using the > > code to read TSC. > > > > What sort of timing accuracy is expected from the app? > > > > From benchmarking the maximum single-threaded rate of these reads: > > > > x86_64: > > > > read 7 ns/op > > read_sync 22 ns/op > > > > A57: > > > > read 4 ns/op > > read_sync 26 ns/op > > > > read_sync issues a synchronizing instruction for greater timing accuracy > > but clearly takes more time to return the time value read from the core. > > Accuracy is as good as implementation can offer with reasonable overhead. We do not put any nsec figures into API spec. ODP API should offer application the most efficient way to read time anyway. 'reasonable' is what we need to define. Another reason why you're seeing a performance boost on x86 is that when switching from clock_gettime() to RDTSC, you're no longer issuing a synchronizing instruction (fence). As shown above, this can be a significant factor depending on how often the time is being sampled. However, there is a loss in timing accuracy because the load of the value may not happen at the time it happens in program order. This is why a synchronizing instruction needs to be used, but it slows down the execution of the thread on the core... > This patch does not take a position which way TSC should be read. There are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp is not as widely supported as rdtsc. This detail is a magnitude less significant compared to: use system call vs direct TSC read. It can be tuned later. This patch set helps if rdtscp should be used later on (introduces x86 cpu flags). So you're saying that you do not need the synchronizing instruction, and the loss of timing accuracy is OK, right? > -Petri >
On 04/26 08:02:09, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > > > diff --git a/platform/linux- > > generic/include/odp/api/plat/time_types.h > > > > b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > index 4847f3b1..8ae7ce82 100644 > > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > @@ -26,11 +26,28 @@ extern "C" { > > > > > * the linux timespec structure, which is dependent on POSIX > > extension > > > > level. > > > > > */ > > > > > typedef struct odp_time_t { > > > > > - int64_t tv_sec; /**< @internal Seconds */ > > > > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > > > > + union { > > > > > + /** @internal Posix timespec */ > > > > > + struct { > > > > > + /** @internal Seconds */ > > > > > + int64_t tv_sec; > > > > > + > > > > > + /** @internal Nanoseconds */ > > > > > + int64_t tv_nsec; > > > > > + } spec; > > > > > + > > > > > + /** @internal HW time counter */ > > > > > + struct { > > > > > + /** @internal Counter value */ > > > > > + uint64_t count; > > > > > + > > > > > + /** @internal Reserved */ > > > > > + uint64_t res; > > > > > + } hw; > > > > > + }; > > > > > > > > A processor's tick counter cannot be used to measure calendar time by > > > > itself. The delta between two ticks, however, can be converted to > > > > calendar time. > > > > > > > > Please see the proposal that introduces odp_tick_t which eliminates > > > > the wasted u64 reserved field in the union above. > > > > > > > > > Nothing is wasted here today. > > > > The 64-bit CPU time is stored in a 128-bit data structure where 64-bits > > are wasted. You can use just a 64-bit type and then convert that to > > a timespec (using a starting timestamp taken on global init) if needed. > > > > Nothing is wasted compared to the situation today. Entire timespec is stored as before. If you want to compress timespec, it's another topic. Compress means additional complexity and overhead. This patch set just adds ability to use 64 but HW time when available. Timespec side implementation remains as is. You are re-using the calendar time type to store a value read from the CPU counter. A different approach is to use a different type (64 bits only for no wasted space) for the value read from the CPU. This value must be converted to a calendar time time if needed. It cannot just be used to represent calendar time. However, for some applications and use cases, you don't need to convert to calendar time in order to measure across two reads of the counter. You can also do direct arithmetic on the value instead of arithmetic on a timespec. If you want to read code instead of email or design documentation, look at this: https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_bench.c https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time.h All you need is a 64 bit type, a way to read from the counter as well as get the frequency, and 2 very small conversion functions. > > > This is because all our code (in this and other files) does "static > > inline" for functions that we want to be inlined. Static is used for > > functions that are static, but we don't care if those are inlined (slow > > path). Most time API functions are fast path. > > > > I am saying that adding the 'inline' qualifier here doesn't actually do > > anything > > because the compiler will consider static functions for inlining anyways. > > > We all know that those are considered. We use "inline" as C standard describes: to suggest which function calls should be as fast as possible. > > > > > > > + > > > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > > > > > +{ > > > > > + odp_time_t time; > > > > > + > > > > > + time.hw.res = 0; > > > > > + time.hw.count = t1.hw.count + t2.hw.count; > > > > > + > > > > > + return time; > > > > > +} > > > > > > > > If a single u64 were used this code wouldn't need to exist. > > > > > > Which code? hw.res = 0 ? It not actually used for anything and could be > > removed. Although, the performance gain would not be huge. Anyway, I'll > > remove it for v2. > > > > If the CPU time was stored in a 64-bit type, you can use arithmetic > > operators > > on the values directly instead of doing arithmetic on a 128-bit compound > > datatype where 64-bits are unused. This is the union above. > > > The implementation above does 64bit arithmetic. The reserved field is not used for anything. V2 removed all .res references (except one due to a false GCC warning). We have abstract API definition, so that application remains portable also, when time cannot be represented with a single integer type. > > It's implementation defined, if e.g. the above sum function is inlined and thus results no overhead at all. > > -Petri > >
On 04/27 10:02:24, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > -----Original Message----- > > From: Brian Brooks [mailto:brian.brooks@arm.com] > > Sent: Wednesday, April 19, 2017 10:05 PM > > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com> > > Cc: lng-odp@lists.linaro.org > > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time > > counter when available > > > > On 04/26 08:02:09, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > > > > > > > > > diff --git a/platform/linux- > > > > generic/include/odp/api/plat/time_types.h > > > > > > b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > > > index 4847f3b1..8ae7ce82 100644 > > > > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > > > @@ -26,11 +26,28 @@ extern "C" { > > > > > > > * the linux timespec structure, which is dependent on POSIX > > > > extension > > > > > > level. > > > > > > > */ > > > > > > > typedef struct odp_time_t { > > > > > > > - int64_t tv_sec; /**< @internal Seconds */ > > > > > > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > > > > > > + union { > > > > > > > + /** @internal Posix timespec */ > > > > > > > + struct { > > > > > > > + /** @internal Seconds */ > > > > > > > + int64_t tv_sec; > > > > > > > + > > > > > > > + /** @internal Nanoseconds */ > > > > > > > + int64_t tv_nsec; > > > > > > > + } spec; > > > > > > > + > > > > > > > + /** @internal HW time counter */ > > > > > > > + struct { > > > > > > > + /** @internal Counter value */ > > > > > > > + uint64_t count; > > > > > > > + > > > > > > > + /** @internal Reserved */ > > > > > > > + uint64_t res; > > > > > > > + } hw; > > > > > > > + }; > > > > > > > > > > > > A processor's tick counter cannot be used to measure calendar time > > by > > > > > > itself. The delta between two ticks, however, can be converted to > > > > > > calendar time. > > > > > > > > > > > > Please see the proposal that introduces odp_tick_t which > > eliminates > > > > > > the wasted u64 reserved field in the union above. > > > > > > > > > > > > > > > Nothing is wasted here today. > > > > > > > > The 64-bit CPU time is stored in a 128-bit data structure where 64- > > bits > > > > are wasted. You can use just a 64-bit type and then convert that to > > > > a timespec (using a starting timestamp taken on global init) if > > needed. > > > > > > > > > > Nothing is wasted compared to the situation today. Entire timespec is > > stored as before. If you want to compress timespec, it's another topic. > > Compress means additional complexity and overhead. This patch set just > > adds ability to use 64 but HW time when available. Timespec side > > implementation remains as is. > > > > You are re-using the calendar time type to store a value read from the CPU > > counter. A different approach is to use a different type (64 bits only for > > no > > wasted space) for the value read from the CPU. > > > > This value must be converted to a calendar time time if needed. It cannot > > just > > be used to represent calendar time. > > > > However, for some applications and use cases, you don't need to convert to > > calendar time > > in order to measure across two reads of the counter. You can also do > > direct arithmetic > > on the value instead of arithmetic on a timespec. > > > > If you want to read code instead of email or design documentation, look at > > this: > > > > > > https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_ben > > ch.c > > https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time.h > > > > All you need is a 64 bit type, a way to read from the counter as well as > > get > > the frequency, and 2 very small conversion functions. > > > First, odp_time_t is our storage type for time values. Today odp-linux defines that as 2x64bits. Nobody has complained about too high time storage consumption. I'm not changing size of the type here. It's not purpose of this patch set to e.g. compress timespec and thus minimize memory foot print of time storage. odp_time_t represents a calendar time. You can "get" an odp_time_t and tell what calendar time it is because its value represents a defined offset from some point in calendar time. You cannot do the same with the value read from the CPU counter because there is no defined offset from a point in calendar time. The HW does not guarantee this. That is why I have proposed a separate odp_tick_t type. This type need also only be 64 bits. Wasting an additional 64 bits (to store a tick in a odp_time_t) is costly when you have many in- flight events in the system. > Second, are you proposing an API change? Change of time type definition? Again, it is documented here: https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-eL0oGLAQ4OM/edit#heading=h.2a3mt9nt0qsp > Ola's cntr_now() function changes as soon as your time source is not a 64bit counter, but e.g. timespec (2x64bit). It is not Ola's code, but if it changes the perspective in a positive way it can be. "as soon as" can be quantified as not ARMv8 and not x64_64. I think that is the exception rather than the norm. In this case, a struct timespec can be converted into nanosecond representation and stored in a 64 bit value. That is the way to design in favor of ARMv8 and x86_64; no space waste and no expensive conversion into a timespec. Of course, this means that application code has to use a new type: odp_tick_t. > It would need to compress that. How application would do arithmetic on the compressed timespec ? You multiply seconds by 1,000,000,000 and add the nanoseconds to get the epoch offset in terms of nanoseconds - this can be stored in a uint64_t. And, now you can "spec" that time is in terms of nanoseconds not "implementation defined". The application can now do arithmetic too. > Third, compression/conversion usually mean division operation. Division is tens of times heavier operation than sum, dec, mult. Currently, ODP API is defined so that conversion is done only when application asks for it (odp_time_t <-> nsec time). Compression is a trade-off between storage space size and performance. I prefer to measure. Perhaps time operations using CPU tick masked by a timespec should be added here: https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_bench.c Taking the difference between two 64-bit values and calling nsec_from_cntr() is quite fast - as fast (faster on ARM) as calling clock_gettime() once through the vDSO. > -Petri > >
On 04/27 08:21:34, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > -----Original Message----- > > From: Brian Brooks [mailto:brian.brooks@arm.com] > > Sent: Wednesday, April 19, 2017 9:46 PM > > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com> > > Cc: lng-odp@lists.linaro.org > > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time > > counter when available > > > > On 04/26 07:30:15, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > > > > > This function (cpu_global_time()) is called only when we have > > first > > > > checked that TSC is invariant. Also we measure the TSC frequency in > > that > > > > case. This function is defined in the same file as cpu_cycles(), and > > the > > > > file is x86 specific. So, we know what we are doing, and just re-using > > the > > > > code to read TSC. > > > > > > > > What sort of timing accuracy is expected from the app? > > > > > > > > From benchmarking the maximum single-threaded rate of these reads: > > > > > > > > x86_64: > > > > > > > > read 7 ns/op > > > > read_sync 22 ns/op > > > > > > > > A57: > > > > > > > > read 4 ns/op > > > > read_sync 26 ns/op > > > > > > > > read_sync issues a synchronizing instruction for greater timing > > accuracy > > > > but clearly takes more time to return the time value read from the > > core. > > > > > > Accuracy is as good as implementation can offer with reasonable > > overhead. We do not put any nsec figures into API spec. ODP API should > > offer application the most efficient way to read time anyway. > > > > 'reasonable' is what we need to define. > > > > Another reason why you're seeing a performance boost on x86 is that when > > switching from clock_gettime() to RDTSC, you're no longer issuing a > > synchronizing > > instruction (fence). As shown above, this can be a significant factor > > depending > > on how often the time is being sampled. > > > > However, there is a loss in timing accuracy because the load of the value > > may not happen at the time it happens in program order. This is why a > > synchronizing instruction needs to be used, but it slows down the > > execution > > of the thread on the core... > > > > > This patch does not take a position which way TSC should be read. There > > are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current > > code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp > > is not as widely supported as rdtsc. This detail is a magnitude less > > significant compared to: use system call vs direct TSC read. It can be > > tuned later. This patch set helps if rdtscp should be used later on > > (introduces x86 cpu flags). > > > > So you're saying that you do not need the synchronizing instruction, and > > the > > loss of timing accuracy is OK, right? > > > What is our timing accuracy today? How much jitter the system call (and everything may launch) causes in the current implementation? How much we are losing accuracy compared what we have today? I'd say we are better off that today anyway, just because we avoid the system call. > > We don't have fence today on rdtsc read for cycle count. This patch does not add it, either. If we are good without it for cycles, we are good for nsec also. > > The scale time scale application typically measures is in order of micro to milliseconds - thousands to millions of nanoseconds. If out of order execution moves the sample position e.g. 20 nsec (40 CPU cycles), it is not significant error to the measurement. Already single cache miss during a measurement causes same kind of noise to the measurement. Also CPU crystal might not be too accurate either, etc. Thanks for confirming that from the application point of view. Can also comment as to whether this is the same position for Timer Pool, Timer, and Timeout events? > -Petri > >
On 04/21 16:11:34, Petri Savolainen wrote: > Use 64 bit HW time counter when available. It is used on > x86 when invariant TSC CPU flag indicates that TSC frequency > is constant. Otherwise, the system time is used as before. Direct > HW time counter usage avoids system call, and related latency > and performance issues. > > Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org> > --- > platform/linux-generic/Makefile.am | 1 + > platform/linux-generic/arch/arm/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/default/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/mips64/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/powerpc/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/x86/cpu_flags.c | 9 + > platform/linux-generic/arch/x86/odp_cpu_arch.c | 59 ++++ > .../include/odp/api/plat/time_types.h | 23 +- > platform/linux-generic/include/odp_time_internal.h | 24 ++ > platform/linux-generic/odp_time.c | 303 ++++++++++++++++----- > 10 files changed, 398 insertions(+), 65 deletions(-) > create mode 100644 platform/linux-generic/include/odp_time_internal.h > > diff --git a/platform/linux-generic/Makefile.am b/platform/linux-generic/Makefile.am > index 60b7f849..ed66fecf 100644 > --- a/platform/linux-generic/Makefile.am > +++ b/platform/linux-generic/Makefile.am > @@ -171,6 +171,7 @@ noinst_HEADERS = \ > ${srcdir}/include/odp_schedule_if.h \ > ${srcdir}/include/odp_sorted_list_internal.h \ > ${srcdir}/include/odp_shm_internal.h \ > + ${srcdir}/include/odp_time_internal.h \ > ${srcdir}/include/odp_timer_internal.h \ > ${srcdir}/include/odp_timer_wheel_internal.h \ > ${srcdir}/include/odp_traffic_mngr_internal.h \ > diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c b/platform/linux-generic/arch/arm/odp_cpu_arch.c > index 2ac223e0..3a87f09c 100644 > --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c > @@ -13,6 +13,7 @@ > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > > #define GIGA 1000000000 > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c b/platform/linux-generic/arch/default/odp_cpu_arch.c > index 2ac223e0..3a87f09c 100644 > --- a/platform/linux-generic/arch/default/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c > @@ -13,6 +13,7 @@ > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > > #define GIGA 1000000000 > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > index 646acf9c..a9a94531 100644 > --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > @@ -7,6 +7,7 @@ > #include <odp/api/cpu.h> > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > +#include <odp_time_internal.h> > > uint64_t odp_cpu_cycles(void) > { > @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > index 2ac223e0..3a87f09c 100644 > --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > @@ -13,6 +13,7 @@ > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > > #define GIGA 1000000000 > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/x86/cpu_flags.c b/platform/linux-generic/arch/x86/cpu_flags.c > index 8fb9477a..cde8ad3e 100644 > --- a/platform/linux-generic/arch/x86/cpu_flags.c > +++ b/platform/linux-generic/arch/x86/cpu_flags.c > @@ -38,6 +38,7 @@ > */ > > #include <arch/x86/cpu_flags.h> > +#include <odp_time_internal.h> > #include <stdio.h> > #include <stdint.h> > > @@ -347,3 +348,11 @@ void cpu_flags_print_all(void) > > printf("\n\n"); > } > + > +int cpu_has_global_time(void) > +{ > + if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0) > + return 1; > + > + return 0; > +} > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c b/platform/linux-generic/arch/x86/odp_cpu_arch.c > index c8cf27b6..9ba601a3 100644 > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > @@ -3,7 +3,14 @@ > * > * SPDX-License-Identifier: BSD-3-Clause > */ > + > +#include <odp_posix_extensions.h> > + > #include <odp/api/cpu.h> > +#include <odp_time_internal.h> > +#include <odp_debug_internal.h> > + > +#include <time.h> > > uint64_t odp_cpu_cycles(void) > { > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return odp_cpu_cycles(); A cycle counter cannot always be used to measure time. Even on x86, odp_cpu_cycles() will return the value of RDTSC which is not actually representative of the cycle count. Even if the x86 processor is set to a fixed frequency, the Invariant TSC may run at a different fixed frequency. Please take a look at the odp_tick_t proposal here: https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-eL0oGLAQ4OM/edit?usp=sharing > +} > + > +#define SEC_IN_NS 1000000000ULL > + > +/* Measure TSC frequency. Frequency information registers are defined for x86, > + * but those are often not enumerated. */ > +uint64_t cpu_global_time_freq(void) > +{ The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's frequency in kHz. It would be better to use this when running under a hypervisor. It appears to work under VMware as well. > + struct timespec sleep, ts1, ts2; > + uint64_t t1, t2, ts_nsec, cycles, hz; > + int i; > + uint64_t avg = 0; > + int rounds = 4; > + > + for (i = 0; i < rounds; i++) { > + sleep.tv_sec = 0; > + sleep.tv_nsec = SEC_IN_NS / 10; > + > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) { > + ODP_DBG("clock_gettime failed\n"); > + return 0; > + } > + > + t1 = cpu_global_time(); > + > + if (nanosleep(&sleep, NULL) < 0) { > + ODP_DBG("nanosleep failed\n"); > + return 0; > + } > + > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) { > + ODP_DBG("clock_gettime failed\n"); > + return 0; > + } > + > + t2 = cpu_global_time(); > + > + ts_nsec = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS; > + ts_nsec += ts2.tv_nsec - ts1.tv_nsec; > + > + cycles = t2 - t1; > + > + hz = (cycles * SEC_IN_NS) / ts_nsec; > + avg += hz; > + } > + > + return avg / rounds; > +} > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h b/platform/linux-generic/include/odp/api/plat/time_types.h > index 4847f3b1..8ae7ce82 100644 > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > @@ -26,11 +26,28 @@ extern "C" { > * the linux timespec structure, which is dependent on POSIX extension level. > */ > typedef struct odp_time_t { > - int64_t tv_sec; /**< @internal Seconds */ > - int64_t tv_nsec; /**< @internal Nanoseconds */ > + union { > + /** @internal Posix timespec */ > + struct { > + /** @internal Seconds */ > + int64_t tv_sec; > + > + /** @internal Nanoseconds */ > + int64_t tv_nsec; > + } spec; > + > + /** @internal HW time counter */ > + struct { > + /** @internal Counter value */ > + uint64_t count; > + > + /** @internal Reserved */ > + uint64_t res; > + } hw; > + }; A processor's tick counter cannot be used to measure calendar time by itself. The delta between two ticks, however, can be converted to calendar time. Please see the proposal that introduces odp_tick_t which eliminates the wasted u64 reserved field in the union above. > } odp_time_t; > > -#define ODP_TIME_NULL ((odp_time_t){0, 0}) > +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} }) > > /** > * @} > diff --git a/platform/linux-generic/include/odp_time_internal.h b/platform/linux-generic/include/odp_time_internal.h > new file mode 100644 > index 00000000..99ac7977 > --- /dev/null > +++ b/platform/linux-generic/include/odp_time_internal.h > @@ -0,0 +1,24 @@ > +/* Copyright (c) 2017, Linaro Limited > + * All rights reserved. > + * > + * SPDX-License-Identifier: BSD-3-Clause > + */ > + > +#ifndef ODP_TIME_INTERNAL_H_ > +#define ODP_TIME_INTERNAL_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include <stdint.h> > + > +int cpu_has_global_time(void); > +uint64_t cpu_global_time(void); > +uint64_t cpu_global_time_freq(void); > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif > diff --git a/platform/linux-generic/odp_time.c b/platform/linux-generic/odp_time.c > index 81e05224..2dd8f2c4 100644 > --- a/platform/linux-generic/odp_time.c > +++ b/platform/linux-generic/odp_time.c > @@ -10,36 +10,39 @@ > #include <odp/api/time.h> > #include <odp/api/hints.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > +#include <string.h> > +#include <inttypes.h> > > -static odp_time_t start_time; > +typedef struct time_global_t { > + odp_time_t start_time; > + int use_hw; > + uint64_t hw_start; > + uint64_t hw_freq_hz; > +} time_global_t; > > -static inline > -uint64_t time_to_ns(odp_time_t time) > -{ > - uint64_t ns; > - > - ns = time.tv_sec * ODP_TIME_SEC_IN_NS; > - ns += time.tv_nsec; > +static time_global_t global; > > - return ns; > -} > +/* > + * Posix timespec based functions > + */ > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) Static functions in .c files will always be considered for inlining, so the inline qualifier is unnecessary here. You can use always_inline compiler attribute if the disassembly and run time performance is not as expected. Comment applies to nearly every function in this file. > { > odp_time_t time; > > - time.tv_sec = t2.tv_sec - t1.tv_sec; > - time.tv_nsec = t2.tv_nsec - t1.tv_nsec; > + time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec; > + time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec; > > - if (time.tv_nsec < 0) { > - time.tv_nsec += ODP_TIME_SEC_IN_NS; > - --time.tv_sec; > + if (time.spec.tv_nsec < 0) { > + time.spec.tv_nsec += ODP_TIME_SEC_IN_NS; > + --time.spec.tv_sec; > } > > return time; > } > > -static inline odp_time_t time_local(void) > +static inline odp_time_t time_spec_cur(void) > { > int ret; > odp_time_t time; > @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void) > if (odp_unlikely(ret != 0)) > ODP_ABORT("clock_gettime failed\n"); > > - time.tv_sec = sys_time.tv_sec; > - time.tv_nsec = sys_time.tv_nsec; > + time.spec.tv_sec = sys_time.tv_sec; > + time.spec.tv_nsec = sys_time.tv_nsec; > > - return time_diff(time, start_time); > + return time_spec_diff(time, global.start_time); > } > > -static inline int time_cmp(odp_time_t t2, odp_time_t t1) > +static inline uint64_t time_spec_res(void) > { > - if (t2.tv_sec < t1.tv_sec) > + int ret; > + struct timespec tres; > + > + ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > + if (odp_unlikely(ret != 0)) > + ODP_ABORT("clock_getres failed\n"); > + > + return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > +} > + > +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1) > +{ > + if (t2.spec.tv_sec < t1.spec.tv_sec) > return -1; > > - if (t2.tv_sec > t1.tv_sec) > + if (t2.spec.tv_sec > t1.spec.tv_sec) > return 1; > > - return t2.tv_nsec - t1.tv_nsec; > + return t2.spec.tv_nsec - t1.spec.tv_nsec; > } > > -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2) > { > odp_time_t time; > > - time.tv_sec = t2.tv_sec + t1.tv_sec; > - time.tv_nsec = t2.tv_nsec + t1.tv_nsec; > + time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec; > + time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec; > > - if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > - time.tv_nsec -= ODP_TIME_SEC_IN_NS; > - ++time.tv_sec; > + if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > + time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS; > + ++time.spec.tv_sec; > } > > return time; > } > > -static inline odp_time_t time_local_from_ns(uint64_t ns) > +static inline uint64_t time_spec_to_ns(odp_time_t time) > +{ > + uint64_t ns; > + > + ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > + ns += time.spec.tv_nsec; > + > + return ns; > +} > + > +static inline odp_time_t time_spec_from_ns(uint64_t ns) > { > odp_time_t time; > > - time.tv_sec = ns / ODP_TIME_SEC_IN_NS; > - time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS; > + time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS; > + time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > > return time; > } > > -static inline void time_wait_until(odp_time_t time) > +/* > + * HW time counter based functions > + */ > + > +static inline odp_time_t time_hw_cur(void) > { > - odp_time_t cur; > + odp_time_t time; > > - do { > - cur = time_local(); > - } while (time_cmp(time, cur) > 0); > + time.hw.res = 0; > + time.hw.count = cpu_global_time() - global.hw_start; > + > + return time; > } > > -static inline uint64_t time_local_res(void) > +static inline uint64_t time_hw_res(void) I think 'freq' is a more descriptive name than 'res'. It would also be good to document at either the function declaration or the function definition (if multiple implementations differ) whether the frequency is in Hz or kHz. For example, on ARMv8 it is just a register read to get Hz. No estimation, and no lowering to kHz. > { > - int ret; > - struct timespec tres; > + /* Promise a bit lower resolution than average cycle counter > + * frequency */ > + return global.hw_freq_hz / 10; > +} > > - ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > - if (odp_unlikely(ret != 0)) > - ODP_ABORT("clock_getres failed\n"); > +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1) > +{ > + if (odp_likely(t2.hw.count > t1.hw.count)) > + return 1; > > - return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > + if (t2.hw.count < t1.hw.count) > + return -1; > + > + return 0; > +} > + > +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1) > +{ > + odp_time_t time; > + > + time.hw.res = 0; > + time.hw.count = t2.hw.count - t1.hw.count; > + > + return time; > +} > + > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > +{ > + odp_time_t time; > + > + time.hw.res = 0; > + time.hw.count = t1.hw.count + t2.hw.count; > + > + return time; > +} If a single u64 were used this code wouldn't need to exist. > +static inline uint64_t time_hw_to_ns(odp_time_t time) > +{ > + uint64_t nsec; > + uint64_t freq_hz = global.hw_freq_hz; > + uint64_t count = time.hw.count; > + uint64_t sec = 0; > + > + if (count >= freq_hz) { > + sec = count / freq_hz; > + count = count - sec * freq_hz; > + } > + > + nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz; > + > + return (sec * ODP_TIME_SEC_IN_NS) + nsec; > +} > + > +static inline odp_time_t time_hw_from_ns(uint64_t ns) > +{ > + odp_time_t time; > + uint64_t count; > + uint64_t freq_hz = global.hw_freq_hz; > + uint64_t sec = 0; > + > + if (ns >= ODP_TIME_SEC_IN_NS) { > + sec = ns / ODP_TIME_SEC_IN_NS; > + ns = ns - sec * ODP_TIME_SEC_IN_NS; > + } > + > + count = sec * freq_hz; > + count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS; > + > + time.hw.res = 0; > + time.hw.count = count; > + > + return time; > +} > + > +/* > + * Common functions > + */ > + > +static inline odp_time_t time_cur(void) > +{ > + if (global.use_hw) > + return time_hw_cur(); > + > + return time_spec_cur(); > +} > + > +static inline uint64_t time_res(void) > +{ > + if (global.use_hw) > + return time_hw_res(); > + > + return time_spec_res(); > +} > + > +static inline int time_cmp(odp_time_t t2, odp_time_t t1) > +{ > + if (global.use_hw) > + return time_hw_cmp(t2, t1); > + > + return time_spec_cmp(t2, t1); > +} > + > +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > +{ > + if (global.use_hw) > + return time_hw_diff(t2, t1); > + > + return time_spec_diff(t2, t1); > +} > + > +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > +{ > + if (global.use_hw) > + return time_hw_sum(t1, t2); > + > + return time_spec_sum(t1, t2); > +} > + > +static inline uint64_t time_to_ns(odp_time_t time) > +{ > + if (global.use_hw) > + return time_hw_to_ns(time); > + > + return time_spec_to_ns(time); > +} > + > +static inline odp_time_t time_from_ns(uint64_t ns) > +{ > + if (global.use_hw) > + return time_hw_from_ns(ns); > + > + return time_spec_from_ns(ns); > +} > + > +static inline void time_wait_until(odp_time_t time) > +{ > + odp_time_t cur; > + > + do { > + cur = time_cur(); > + } while (time_cmp(time, cur) > 0); > } > > odp_time_t odp_time_local(void) > { > - return time_local(); > + return time_cur(); > } > > odp_time_t odp_time_global(void) > { > - return time_local(); > + return time_cur(); > } > > odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1) > @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time) > > odp_time_t odp_time_local_from_ns(uint64_t ns) > { > - return time_local_from_ns(ns); > + return time_from_ns(ns); > } > > odp_time_t odp_time_global_from_ns(uint64_t ns) > { > - return time_local_from_ns(ns); > + return time_from_ns(ns); > } > > int odp_time_cmp(odp_time_t t2, odp_time_t t1) > @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t t2) > > uint64_t odp_time_local_res(void) > { > - return time_local_res(); > + return time_res(); > } > > uint64_t odp_time_global_res(void) > { > - return time_local_res(); > + return time_res(); > } > > void odp_time_wait_ns(uint64_t ns) > { > - odp_time_t cur = time_local(); > - odp_time_t wait = time_local_from_ns(ns); > + odp_time_t cur = time_cur(); > + odp_time_t wait = time_from_ns(ns); > odp_time_t end_time = time_sum(cur, wait); > > time_wait_until(end_time); > @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time) > > int odp_time_init_global(void) > { > - int ret; > - struct timespec time; > - > - ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time); > - if (ret) { > - start_time = ODP_TIME_NULL; > - } else { > - start_time.tv_sec = time.tv_sec; > - start_time.tv_nsec = time.tv_nsec; > + struct timespec sys_time; > + int ret = 0; > + > + memset(&global, 0, sizeof(time_global_t)); > + > + if (cpu_has_global_time()) { > + global.use_hw = 1; > + global.hw_freq_hz = cpu_global_time_freq(); > + > + if (global.hw_freq_hz == 0) > + return -1; > + > + printf("HW time counter freq: %" PRIu64 " hz\n\n", > + global.hw_freq_hz); > + > + global.hw_start = cpu_global_time(); > + return 0; > + } > + > + global.start_time = ODP_TIME_NULL; > + > + ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time); > + if (ret == 0) { > + global.start_time.spec.tv_sec = sys_time.tv_sec; > + global.start_time.spec.tv_nsec = sys_time.tv_nsec; > } > > return ret; > -- > 2.11.0 >
Good stuff. I'll put this on the agenda for Monday's ARCH call to discuss. On Fri, Apr 21, 2017 at 12:29 PM, Brian Brooks <brian.brooks@arm.com> wrote: > On 04/21 16:11:34, Petri Savolainen wrote: > > Use 64 bit HW time counter when available. It is used on > > x86 when invariant TSC CPU flag indicates that TSC frequency > > is constant. Otherwise, the system time is used as before. Direct > > HW time counter usage avoids system call, and related latency > > and performance issues. > > > > Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org> > > --- > > platform/linux-generic/Makefile.am | 1 + > > platform/linux-generic/arch/arm/odp_cpu_arch.c | 11 + > > platform/linux-generic/arch/default/odp_cpu_arch.c | 11 + > > platform/linux-generic/arch/mips64/odp_cpu_arch.c | 11 + > > platform/linux-generic/arch/powerpc/odp_cpu_arch.c | 11 + > > platform/linux-generic/arch/x86/cpu_flags.c | 9 + > > platform/linux-generic/arch/x86/odp_cpu_arch.c | 59 ++++ > > .../include/odp/api/plat/time_types.h | 23 +- > > platform/linux-generic/include/odp_time_internal.h | 24 ++ > > platform/linux-generic/odp_time.c | 303 > ++++++++++++++++----- > > 10 files changed, 398 insertions(+), 65 deletions(-) > > create mode 100644 platform/linux-generic/include/odp_time_internal.h > > > > diff --git a/platform/linux-generic/Makefile.am > b/platform/linux-generic/Makefile.am > > index 60b7f849..ed66fecf 100644 > > --- a/platform/linux-generic/Makefile.am > > +++ b/platform/linux-generic/Makefile.am > > @@ -171,6 +171,7 @@ noinst_HEADERS = \ > > ${srcdir}/include/odp_schedule_if.h \ > > ${srcdir}/include/odp_sorted_list_internal.h \ > > ${srcdir}/include/odp_shm_internal.h \ > > + ${srcdir}/include/odp_time_internal.h \ > > ${srcdir}/include/odp_timer_internal.h \ > > ${srcdir}/include/odp_timer_wheel_internal.h \ > > ${srcdir}/include/odp_traffic_mngr_internal.h \ > > diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c > b/platform/linux-generic/arch/arm/odp_cpu_arch.c > > index 2ac223e0..3a87f09c 100644 > > --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c > > +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c > > @@ -13,6 +13,7 @@ > > #include <odp/api/hints.h> > > #include <odp/api/system_info.h> > > #include <odp_debug_internal.h> > > +#include <odp_time_internal.h> > > > > #define GIGA 1000000000 > > > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > { > > return 1; > > } > > + > > +uint64_t cpu_global_time(void) > > +{ > > + return 0; > > +} > > + > > +uint64_t cpu_global_time_freq(void) > > +{ > > + return 0; > > +} > > diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c > b/platform/linux-generic/arch/default/odp_cpu_arch.c > > index 2ac223e0..3a87f09c 100644 > > --- a/platform/linux-generic/arch/default/odp_cpu_arch.c > > +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c > > @@ -13,6 +13,7 @@ > > #include <odp/api/hints.h> > > #include <odp/api/system_info.h> > > #include <odp_debug_internal.h> > > +#include <odp_time_internal.h> > > > > #define GIGA 1000000000 > > > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > { > > return 1; > > } > > + > > +uint64_t cpu_global_time(void) > > +{ > > + return 0; > > +} > > + > > +uint64_t cpu_global_time_freq(void) > > +{ > > + return 0; > > +} > > diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c > b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > > index 646acf9c..a9a94531 100644 > > --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c > > +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > > @@ -7,6 +7,7 @@ > > #include <odp/api/cpu.h> > > #include <odp/api/hints.h> > > #include <odp/api/system_info.h> > > +#include <odp_time_internal.h> > > > > uint64_t odp_cpu_cycles(void) > > { > > @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > { > > return 1; > > } > > + > > +uint64_t cpu_global_time(void) > > +{ > > + return 0; > > +} > > + > > +uint64_t cpu_global_time_freq(void) > > +{ > > + return 0; > > +} > > diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > > index 2ac223e0..3a87f09c 100644 > > --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > > +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > > @@ -13,6 +13,7 @@ > > #include <odp/api/hints.h> > > #include <odp/api/system_info.h> > > #include <odp_debug_internal.h> > > +#include <odp_time_internal.h> > > > > #define GIGA 1000000000 > > > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > { > > return 1; > > } > > + > > +uint64_t cpu_global_time(void) > > +{ > > + return 0; > > +} > > + > > +uint64_t cpu_global_time_freq(void) > > +{ > > + return 0; > > +} > > diff --git a/platform/linux-generic/arch/x86/cpu_flags.c > b/platform/linux-generic/arch/x86/cpu_flags.c > > index 8fb9477a..cde8ad3e 100644 > > --- a/platform/linux-generic/arch/x86/cpu_flags.c > > +++ b/platform/linux-generic/arch/x86/cpu_flags.c > > @@ -38,6 +38,7 @@ > > */ > > > > #include <arch/x86/cpu_flags.h> > > +#include <odp_time_internal.h> > > #include <stdio.h> > > #include <stdint.h> > > > > @@ -347,3 +348,11 @@ void cpu_flags_print_all(void) > > > > printf("\n\n"); > > } > > + > > +int cpu_has_global_time(void) > > +{ > > + if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0) > > + return 1; > > + > > + return 0; > > +} > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c > b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > index c8cf27b6..9ba601a3 100644 > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > @@ -3,7 +3,14 @@ > > * > > * SPDX-License-Identifier: BSD-3-Clause > > */ > > + > > +#include <odp_posix_extensions.h> > > + > > #include <odp/api/cpu.h> > > +#include <odp_time_internal.h> > > +#include <odp_debug_internal.h> > > + > > +#include <time.h> > > > > uint64_t odp_cpu_cycles(void) > > { > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > > { > > return 1; > > } > > + > > +uint64_t cpu_global_time(void) > > +{ > > + return odp_cpu_cycles(); > > A cycle counter cannot always be used to measure time. Even on x86, > odp_cpu_cycles() will return the value of RDTSC which is not actually > representative of the cycle count. Even if the x86 processor is set > to a fixed frequency, the Invariant TSC may run at a different fixed > frequency. Please take a look at the odp_tick_t proposal here: > > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- > eL0oGLAQ4OM/edit?usp=sharing > > > +} > > + > > +#define SEC_IN_NS 1000000000ULL > > + > > +/* Measure TSC frequency. Frequency information registers are defined > for x86, > > + * but those are often not enumerated. */ > > +uint64_t cpu_global_time_freq(void) > > +{ > > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's > frequency in kHz. It would be better to use this when running under a > hypervisor. It appears to work under VMware as well. > > > + struct timespec sleep, ts1, ts2; > > + uint64_t t1, t2, ts_nsec, cycles, hz; > > + int i; > > + uint64_t avg = 0; > > + int rounds = 4; > > + > > + for (i = 0; i < rounds; i++) { > > + sleep.tv_sec = 0; > > + sleep.tv_nsec = SEC_IN_NS / 10; > > + > > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) { > > + ODP_DBG("clock_gettime failed\n"); > > + return 0; > > + } > > + > > + t1 = cpu_global_time(); > > + > > + if (nanosleep(&sleep, NULL) < 0) { > > + ODP_DBG("nanosleep failed\n"); > > + return 0; > > + } > > + > > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) { > > + ODP_DBG("clock_gettime failed\n"); > > + return 0; > > + } > > + > > + t2 = cpu_global_time(); > > + > > + ts_nsec = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS; > > + ts_nsec += ts2.tv_nsec - ts1.tv_nsec; > > + > > + cycles = t2 - t1; > > + > > + hz = (cycles * SEC_IN_NS) / ts_nsec; > > + avg += hz; > > + } > > + > > + return avg / rounds; > > +} > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h > b/platform/linux-generic/include/odp/api/plat/time_types.h > > index 4847f3b1..8ae7ce82 100644 > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > @@ -26,11 +26,28 @@ extern "C" { > > * the linux timespec structure, which is dependent on POSIX extension > level. > > */ > > typedef struct odp_time_t { > > - int64_t tv_sec; /**< @internal Seconds */ > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > + union { > > + /** @internal Posix timespec */ > > + struct { > > + /** @internal Seconds */ > > + int64_t tv_sec; > > + > > + /** @internal Nanoseconds */ > > + int64_t tv_nsec; > > + } spec; > > + > > + /** @internal HW time counter */ > > + struct { > > + /** @internal Counter value */ > > + uint64_t count; > > + > > + /** @internal Reserved */ > > + uint64_t res; > > + } hw; > > + }; > > A processor's tick counter cannot be used to measure calendar time by > itself. The delta between two ticks, however, can be converted to > calendar time. > > Please see the proposal that introduces odp_tick_t which eliminates > the wasted u64 reserved field in the union above. > > > > } odp_time_t; > > > > -#define ODP_TIME_NULL ((odp_time_t){0, 0}) > > +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} }) > > > > /** > > * @} > > diff --git a/platform/linux-generic/include/odp_time_internal.h > b/platform/linux-generic/include/odp_time_internal.h > > new file mode 100644 > > index 00000000..99ac7977 > > --- /dev/null > > +++ b/platform/linux-generic/include/odp_time_internal.h > > @@ -0,0 +1,24 @@ > > +/* Copyright (c) 2017, Linaro Limited > > + * All rights reserved. > > + * > > + * SPDX-License-Identifier: BSD-3-Clause > > + */ > > + > > +#ifndef ODP_TIME_INTERNAL_H_ > > +#define ODP_TIME_INTERNAL_H_ > > + > > +#ifdef __cplusplus > > +extern "C" { > > +#endif > > + > > +#include <stdint.h> > > + > > +int cpu_has_global_time(void); > > +uint64_t cpu_global_time(void); > > +uint64_t cpu_global_time_freq(void); > > + > > +#ifdef __cplusplus > > +} > > +#endif > > + > > +#endif > > diff --git a/platform/linux-generic/odp_time.c > b/platform/linux-generic/odp_time.c > > index 81e05224..2dd8f2c4 100644 > > --- a/platform/linux-generic/odp_time.c > > +++ b/platform/linux-generic/odp_time.c > > @@ -10,36 +10,39 @@ > > #include <odp/api/time.h> > > #include <odp/api/hints.h> > > #include <odp_debug_internal.h> > > +#include <odp_time_internal.h> > > +#include <string.h> > > +#include <inttypes.h> > > > > -static odp_time_t start_time; > > +typedef struct time_global_t { > > + odp_time_t start_time; > > + int use_hw; > > + uint64_t hw_start; > > + uint64_t hw_freq_hz; > > +} time_global_t; > > > > -static inline > > -uint64_t time_to_ns(odp_time_t time) > > -{ > > - uint64_t ns; > > - > > - ns = time.tv_sec * ODP_TIME_SEC_IN_NS; > > - ns += time.tv_nsec; > > +static time_global_t global; > > > > - return ns; > > -} > > +/* > > + * Posix timespec based functions > > + */ > > > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) > > Static functions in .c files will always be considered for inlining, so the > inline qualifier is unnecessary here. You can use always_inline compiler > attribute if the disassembly and run time performance is not as expected. > > Comment applies to nearly every function in this file. > > > { > > odp_time_t time; > > > > - time.tv_sec = t2.tv_sec - t1.tv_sec; > > - time.tv_nsec = t2.tv_nsec - t1.tv_nsec; > > + time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec; > > + time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec; > > > > - if (time.tv_nsec < 0) { > > - time.tv_nsec += ODP_TIME_SEC_IN_NS; > > - --time.tv_sec; > > + if (time.spec.tv_nsec < 0) { > > + time.spec.tv_nsec += ODP_TIME_SEC_IN_NS; > > + --time.spec.tv_sec; > > } > > > > return time; > > } > > > > -static inline odp_time_t time_local(void) > > +static inline odp_time_t time_spec_cur(void) > > { > > int ret; > > odp_time_t time; > > @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void) > > if (odp_unlikely(ret != 0)) > > ODP_ABORT("clock_gettime failed\n"); > > > > - time.tv_sec = sys_time.tv_sec; > > - time.tv_nsec = sys_time.tv_nsec; > > + time.spec.tv_sec = sys_time.tv_sec; > > + time.spec.tv_nsec = sys_time.tv_nsec; > > > > - return time_diff(time, start_time); > > + return time_spec_diff(time, global.start_time); > > } > > > > -static inline int time_cmp(odp_time_t t2, odp_time_t t1) > > +static inline uint64_t time_spec_res(void) > > { > > - if (t2.tv_sec < t1.tv_sec) > > + int ret; > > + struct timespec tres; > > + > > + ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > > + if (odp_unlikely(ret != 0)) > > + ODP_ABORT("clock_getres failed\n"); > > + > > + return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > > +} > > + > > +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1) > > +{ > > + if (t2.spec.tv_sec < t1.spec.tv_sec) > > return -1; > > > > - if (t2.tv_sec > t1.tv_sec) > > + if (t2.spec.tv_sec > t1.spec.tv_sec) > > return 1; > > > > - return t2.tv_nsec - t1.tv_nsec; > > + return t2.spec.tv_nsec - t1.spec.tv_nsec; > > } > > > > -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > > +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2) > > { > > odp_time_t time; > > > > - time.tv_sec = t2.tv_sec + t1.tv_sec; > > - time.tv_nsec = t2.tv_nsec + t1.tv_nsec; > > + time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec; > > + time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec; > > > > - if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > > - time.tv_nsec -= ODP_TIME_SEC_IN_NS; > > - ++time.tv_sec; > > + if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > > + time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS; > > + ++time.spec.tv_sec; > > } > > > > return time; > > } > > > > -static inline odp_time_t time_local_from_ns(uint64_t ns) > > +static inline uint64_t time_spec_to_ns(odp_time_t time) > > +{ > > + uint64_t ns; > > + > > + ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > > + ns += time.spec.tv_nsec; > > + > > + return ns; > > +} > > + > > +static inline odp_time_t time_spec_from_ns(uint64_t ns) > > { > > odp_time_t time; > > > > - time.tv_sec = ns / ODP_TIME_SEC_IN_NS; > > - time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS; > > + time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS; > > + time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > > > > return time; > > } > > > > -static inline void time_wait_until(odp_time_t time) > > +/* > > + * HW time counter based functions > > + */ > > + > > +static inline odp_time_t time_hw_cur(void) > > { > > - odp_time_t cur; > > + odp_time_t time; > > > > - do { > > - cur = time_local(); > > - } while (time_cmp(time, cur) > 0); > > + time.hw.res = 0; > > + time.hw.count = cpu_global_time() - global.hw_start; > > + > > + return time; > > } > > > > -static inline uint64_t time_local_res(void) > > +static inline uint64_t time_hw_res(void) > > I think 'freq' is a more descriptive name than 'res'. > > It would also be good to document at either the function declaration > or the function definition (if multiple implementations differ) whether > the frequency is in Hz or kHz. For example, on ARMv8 it is just a > register read to get Hz. No estimation, and no lowering to kHz. > > > { > > - int ret; > > - struct timespec tres; > > + /* Promise a bit lower resolution than average cycle counter > > + * frequency */ > > + return global.hw_freq_hz / 10; > > +} > > > > - ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > > - if (odp_unlikely(ret != 0)) > > - ODP_ABORT("clock_getres failed\n"); > > +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1) > > +{ > > + if (odp_likely(t2.hw.count > t1.hw.count)) > > + return 1; > > > > - return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > > + if (t2.hw.count < t1.hw.count) > > + return -1; > > + > > + return 0; > > +} > > + > > +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1) > > +{ > > + odp_time_t time; > > + > > + time.hw.res = 0; > > + time.hw.count = t2.hw.count - t1.hw.count; > > + > > + return time; > > +} > > + > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > > +{ > > + odp_time_t time; > > + > > + time.hw.res = 0; > > + time.hw.count = t1.hw.count + t2.hw.count; > > + > > + return time; > > +} > > If a single u64 were used this code wouldn't need to exist. > > > +static inline uint64_t time_hw_to_ns(odp_time_t time) > > +{ > > + uint64_t nsec; > > + uint64_t freq_hz = global.hw_freq_hz; > > + uint64_t count = time.hw.count; > > + uint64_t sec = 0; > > + > > + if (count >= freq_hz) { > > + sec = count / freq_hz; > > + count = count - sec * freq_hz; > > + } > > + > > + nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz; > > + > > + return (sec * ODP_TIME_SEC_IN_NS) + nsec; > > +} > > + > > +static inline odp_time_t time_hw_from_ns(uint64_t ns) > > +{ > > + odp_time_t time; > > + uint64_t count; > > + uint64_t freq_hz = global.hw_freq_hz; > > + uint64_t sec = 0; > > + > > + if (ns >= ODP_TIME_SEC_IN_NS) { > > + sec = ns / ODP_TIME_SEC_IN_NS; > > + ns = ns - sec * ODP_TIME_SEC_IN_NS; > > + } > > + > > + count = sec * freq_hz; > > + count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS; > > + > > + time.hw.res = 0; > > + time.hw.count = count; > > + > > + return time; > > +} > > + > > +/* > > + * Common functions > > + */ > > + > > +static inline odp_time_t time_cur(void) > > +{ > > + if (global.use_hw) > > + return time_hw_cur(); > > + > > + return time_spec_cur(); > > +} > > + > > +static inline uint64_t time_res(void) > > +{ > > + if (global.use_hw) > > + return time_hw_res(); > > + > > + return time_spec_res(); > > +} > > + > > +static inline int time_cmp(odp_time_t t2, odp_time_t t1) > > +{ > > + if (global.use_hw) > > + return time_hw_cmp(t2, t1); > > + > > + return time_spec_cmp(t2, t1); > > +} > > + > > +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > > +{ > > + if (global.use_hw) > > + return time_hw_diff(t2, t1); > > + > > + return time_spec_diff(t2, t1); > > +} > > + > > +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > > +{ > > + if (global.use_hw) > > + return time_hw_sum(t1, t2); > > + > > + return time_spec_sum(t1, t2); > > +} > > + > > +static inline uint64_t time_to_ns(odp_time_t time) > > +{ > > + if (global.use_hw) > > + return time_hw_to_ns(time); > > + > > + return time_spec_to_ns(time); > > +} > > + > > +static inline odp_time_t time_from_ns(uint64_t ns) > > +{ > > + if (global.use_hw) > > + return time_hw_from_ns(ns); > > + > > + return time_spec_from_ns(ns); > > +} > > + > > +static inline void time_wait_until(odp_time_t time) > > +{ > > + odp_time_t cur; > > + > > + do { > > + cur = time_cur(); > > + } while (time_cmp(time, cur) > 0); > > } > > > > odp_time_t odp_time_local(void) > > { > > - return time_local(); > > + return time_cur(); > > } > > > > odp_time_t odp_time_global(void) > > { > > - return time_local(); > > + return time_cur(); > > } > > > > odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1) > > @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time) > > > > odp_time_t odp_time_local_from_ns(uint64_t ns) > > { > > - return time_local_from_ns(ns); > > + return time_from_ns(ns); > > } > > > > odp_time_t odp_time_global_from_ns(uint64_t ns) > > { > > - return time_local_from_ns(ns); > > + return time_from_ns(ns); > > } > > > > int odp_time_cmp(odp_time_t t2, odp_time_t t1) > > @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t > t2) > > > > uint64_t odp_time_local_res(void) > > { > > - return time_local_res(); > > + return time_res(); > > } > > > > uint64_t odp_time_global_res(void) > > { > > - return time_local_res(); > > + return time_res(); > > } > > > > void odp_time_wait_ns(uint64_t ns) > > { > > - odp_time_t cur = time_local(); > > - odp_time_t wait = time_local_from_ns(ns); > > + odp_time_t cur = time_cur(); > > + odp_time_t wait = time_from_ns(ns); > > odp_time_t end_time = time_sum(cur, wait); > > > > time_wait_until(end_time); > > @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time) > > > > int odp_time_init_global(void) > > { > > - int ret; > > - struct timespec time; > > - > > - ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time); > > - if (ret) { > > - start_time = ODP_TIME_NULL; > > - } else { > > - start_time.tv_sec = time.tv_sec; > > - start_time.tv_nsec = time.tv_nsec; > > + struct timespec sys_time; > > + int ret = 0; > > + > > + memset(&global, 0, sizeof(time_global_t)); > > + > > + if (cpu_has_global_time()) { > > + global.use_hw = 1; > > + global.hw_freq_hz = cpu_global_time_freq(); > > + > > + if (global.hw_freq_hz == 0) > > + return -1; > > + > > + printf("HW time counter freq: %" PRIu64 " hz\n\n", > > + global.hw_freq_hz); > > + > > + global.hw_start = cpu_global_time(); > > + return 0; > > + } > > + > > + global.start_time = ODP_TIME_NULL; > > + > > + ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time); > > + if (ret == 0) { > > + global.start_time.spec.tv_sec = sys_time.tv_sec; > > + global.start_time.spec.tv_nsec = sys_time.tv_nsec; > > } > > > > return ret; > > -- > > 2.11.0 > > >
On 04/21 14:19:27, Bill Fischofer wrote: > Good stuff. I'll put this on the agenda for Monday's ARCH call to discuss. Hi Bill, Petri, I will only be able to join the Wednesday ARCH call this week. Thanks, Brian > On Fri, Apr 21, 2017 at 12:29 PM, Brian Brooks <brian.brooks@arm.com> wrote: > > > On 04/21 16:11:34, Petri Savolainen wrote: > > > Use 64 bit HW time counter when available. It is used on > > > x86 when invariant TSC CPU flag indicates that TSC frequency > > > is constant. Otherwise, the system time is used as before. Direct > > > HW time counter usage avoids system call, and related latency > > > and performance issues. > > > > > > Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org> > > > --- > > > platform/linux-generic/Makefile.am | 1 + > > > platform/linux-generic/arch/arm/odp_cpu_arch.c | 11 + > > > platform/linux-generic/arch/default/odp_cpu_arch.c | 11 + > > > platform/linux-generic/arch/mips64/odp_cpu_arch.c | 11 + > > > platform/linux-generic/arch/powerpc/odp_cpu_arch.c | 11 + > > > platform/linux-generic/arch/x86/cpu_flags.c | 9 + > > > platform/linux-generic/arch/x86/odp_cpu_arch.c | 59 ++++ > > > .../include/odp/api/plat/time_types.h | 23 +- > > > platform/linux-generic/include/odp_time_internal.h | 24 ++ > > > platform/linux-generic/odp_time.c | 303 > > ++++++++++++++++----- > > > 10 files changed, 398 insertions(+), 65 deletions(-) > > > create mode 100644 platform/linux-generic/include/odp_time_internal.h > > > > > > diff --git a/platform/linux-generic/Makefile.am > > b/platform/linux-generic/Makefile.am > > > index 60b7f849..ed66fecf 100644 > > > --- a/platform/linux-generic/Makefile.am > > > +++ b/platform/linux-generic/Makefile.am > > > @@ -171,6 +171,7 @@ noinst_HEADERS = \ > > > ${srcdir}/include/odp_schedule_if.h \ > > > ${srcdir}/include/odp_sorted_list_internal.h \ > > > ${srcdir}/include/odp_shm_internal.h \ > > > + ${srcdir}/include/odp_time_internal.h \ > > > ${srcdir}/include/odp_timer_internal.h \ > > > ${srcdir}/include/odp_timer_wheel_internal.h \ > > > ${srcdir}/include/odp_traffic_mngr_internal.h \ > > > diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c > > b/platform/linux-generic/arch/arm/odp_cpu_arch.c > > > index 2ac223e0..3a87f09c 100644 > > > --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c > > > +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c > > > @@ -13,6 +13,7 @@ > > > #include <odp/api/hints.h> > > > #include <odp/api/system_info.h> > > > #include <odp_debug_internal.h> > > > +#include <odp_time_internal.h> > > > > > > #define GIGA 1000000000 > > > > > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > > { > > > return 1; > > > } > > > + > > > +uint64_t cpu_global_time(void) > > > +{ > > > + return 0; > > > +} > > > + > > > +uint64_t cpu_global_time_freq(void) > > > +{ > > > + return 0; > > > +} > > > diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c > > b/platform/linux-generic/arch/default/odp_cpu_arch.c > > > index 2ac223e0..3a87f09c 100644 > > > --- a/platform/linux-generic/arch/default/odp_cpu_arch.c > > > +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c > > > @@ -13,6 +13,7 @@ > > > #include <odp/api/hints.h> > > > #include <odp/api/system_info.h> > > > #include <odp_debug_internal.h> > > > +#include <odp_time_internal.h> > > > > > > #define GIGA 1000000000 > > > > > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > > { > > > return 1; > > > } > > > + > > > +uint64_t cpu_global_time(void) > > > +{ > > > + return 0; > > > +} > > > + > > > +uint64_t cpu_global_time_freq(void) > > > +{ > > > + return 0; > > > +} > > > diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c > > b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > > > index 646acf9c..a9a94531 100644 > > > --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c > > > +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > > > @@ -7,6 +7,7 @@ > > > #include <odp/api/cpu.h> > > > #include <odp/api/hints.h> > > > #include <odp/api/system_info.h> > > > +#include <odp_time_internal.h> > > > > > > uint64_t odp_cpu_cycles(void) > > > { > > > @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > > { > > > return 1; > > > } > > > + > > > +uint64_t cpu_global_time(void) > > > +{ > > > + return 0; > > > +} > > > + > > > +uint64_t cpu_global_time_freq(void) > > > +{ > > > + return 0; > > > +} > > > diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > > b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > > > index 2ac223e0..3a87f09c 100644 > > > --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > > > +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > > > @@ -13,6 +13,7 @@ > > > #include <odp/api/hints.h> > > > #include <odp/api/system_info.h> > > > #include <odp_debug_internal.h> > > > +#include <odp_time_internal.h> > > > > > > #define GIGA 1000000000 > > > > > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > > > { > > > return 1; > > > } > > > + > > > +uint64_t cpu_global_time(void) > > > +{ > > > + return 0; > > > +} > > > + > > > +uint64_t cpu_global_time_freq(void) > > > +{ > > > + return 0; > > > +} > > > diff --git a/platform/linux-generic/arch/x86/cpu_flags.c > > b/platform/linux-generic/arch/x86/cpu_flags.c > > > index 8fb9477a..cde8ad3e 100644 > > > --- a/platform/linux-generic/arch/x86/cpu_flags.c > > > +++ b/platform/linux-generic/arch/x86/cpu_flags.c > > > @@ -38,6 +38,7 @@ > > > */ > > > > > > #include <arch/x86/cpu_flags.h> > > > +#include <odp_time_internal.h> > > > #include <stdio.h> > > > #include <stdint.h> > > > > > > @@ -347,3 +348,11 @@ void cpu_flags_print_all(void) > > > > > > printf("\n\n"); > > > } > > > + > > > +int cpu_has_global_time(void) > > > +{ > > > + if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0) > > > + return 1; > > > + > > > + return 0; > > > +} > > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > index c8cf27b6..9ba601a3 100644 > > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > @@ -3,7 +3,14 @@ > > > * > > > * SPDX-License-Identifier: BSD-3-Clause > > > */ > > > + > > > +#include <odp_posix_extensions.h> > > > + > > > #include <odp/api/cpu.h> > > > +#include <odp_time_internal.h> > > > +#include <odp_debug_internal.h> > > > + > > > +#include <time.h> > > > > > > uint64_t odp_cpu_cycles(void) > > > { > > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > > > { > > > return 1; > > > } > > > + > > > +uint64_t cpu_global_time(void) > > > +{ > > > + return odp_cpu_cycles(); > > > > A cycle counter cannot always be used to measure time. Even on x86, > > odp_cpu_cycles() will return the value of RDTSC which is not actually > > representative of the cycle count. Even if the x86 processor is set > > to a fixed frequency, the Invariant TSC may run at a different fixed > > frequency. Please take a look at the odp_tick_t proposal here: > > > > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- > > eL0oGLAQ4OM/edit?usp=sharing > > > > > +} > > > + > > > +#define SEC_IN_NS 1000000000ULL > > > + > > > +/* Measure TSC frequency. Frequency information registers are defined > > for x86, > > > + * but those are often not enumerated. */ > > > +uint64_t cpu_global_time_freq(void) > > > +{ > > > > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's > > frequency in kHz. It would be better to use this when running under a > > hypervisor. It appears to work under VMware as well. > > > > > + struct timespec sleep, ts1, ts2; > > > + uint64_t t1, t2, ts_nsec, cycles, hz; > > > + int i; > > > + uint64_t avg = 0; > > > + int rounds = 4; > > > + > > > + for (i = 0; i < rounds; i++) { > > > + sleep.tv_sec = 0; > > > + sleep.tv_nsec = SEC_IN_NS / 10; > > > + > > > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) { > > > + ODP_DBG("clock_gettime failed\n"); > > > + return 0; > > > + } > > > + > > > + t1 = cpu_global_time(); > > > + > > > + if (nanosleep(&sleep, NULL) < 0) { > > > + ODP_DBG("nanosleep failed\n"); > > > + return 0; > > > + } > > > + > > > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) { > > > + ODP_DBG("clock_gettime failed\n"); > > > + return 0; > > > + } > > > + > > > + t2 = cpu_global_time(); > > > + > > > + ts_nsec = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS; > > > + ts_nsec += ts2.tv_nsec - ts1.tv_nsec; > > > + > > > + cycles = t2 - t1; > > > + > > > + hz = (cycles * SEC_IN_NS) / ts_nsec; > > > + avg += hz; > > > + } > > > + > > > + return avg / rounds; > > > +} > > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h > > b/platform/linux-generic/include/odp/api/plat/time_types.h > > > index 4847f3b1..8ae7ce82 100644 > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > > @@ -26,11 +26,28 @@ extern "C" { > > > * the linux timespec structure, which is dependent on POSIX extension > > level. > > > */ > > > typedef struct odp_time_t { > > > - int64_t tv_sec; /**< @internal Seconds */ > > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > > + union { > > > + /** @internal Posix timespec */ > > > + struct { > > > + /** @internal Seconds */ > > > + int64_t tv_sec; > > > + > > > + /** @internal Nanoseconds */ > > > + int64_t tv_nsec; > > > + } spec; > > > + > > > + /** @internal HW time counter */ > > > + struct { > > > + /** @internal Counter value */ > > > + uint64_t count; > > > + > > > + /** @internal Reserved */ > > > + uint64_t res; > > > + } hw; > > > + }; > > > > A processor's tick counter cannot be used to measure calendar time by > > itself. The delta between two ticks, however, can be converted to > > calendar time. > > > > Please see the proposal that introduces odp_tick_t which eliminates > > the wasted u64 reserved field in the union above. > > > > > > > } odp_time_t; > > > > > > -#define ODP_TIME_NULL ((odp_time_t){0, 0}) > > > +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} }) > > > > > > /** > > > * @} > > > diff --git a/platform/linux-generic/include/odp_time_internal.h > > b/platform/linux-generic/include/odp_time_internal.h > > > new file mode 100644 > > > index 00000000..99ac7977 > > > --- /dev/null > > > +++ b/platform/linux-generic/include/odp_time_internal.h > > > @@ -0,0 +1,24 @@ > > > +/* Copyright (c) 2017, Linaro Limited > > > + * All rights reserved. > > > + * > > > + * SPDX-License-Identifier: BSD-3-Clause > > > + */ > > > + > > > +#ifndef ODP_TIME_INTERNAL_H_ > > > +#define ODP_TIME_INTERNAL_H_ > > > + > > > +#ifdef __cplusplus > > > +extern "C" { > > > +#endif > > > + > > > +#include <stdint.h> > > > + > > > +int cpu_has_global_time(void); > > > +uint64_t cpu_global_time(void); > > > +uint64_t cpu_global_time_freq(void); > > > + > > > +#ifdef __cplusplus > > > +} > > > +#endif > > > + > > > +#endif > > > diff --git a/platform/linux-generic/odp_time.c > > b/platform/linux-generic/odp_time.c > > > index 81e05224..2dd8f2c4 100644 > > > --- a/platform/linux-generic/odp_time.c > > > +++ b/platform/linux-generic/odp_time.c > > > @@ -10,36 +10,39 @@ > > > #include <odp/api/time.h> > > > #include <odp/api/hints.h> > > > #include <odp_debug_internal.h> > > > +#include <odp_time_internal.h> > > > +#include <string.h> > > > +#include <inttypes.h> > > > > > > -static odp_time_t start_time; > > > +typedef struct time_global_t { > > > + odp_time_t start_time; > > > + int use_hw; > > > + uint64_t hw_start; > > > + uint64_t hw_freq_hz; > > > +} time_global_t; > > > > > > -static inline > > > -uint64_t time_to_ns(odp_time_t time) > > > -{ > > > - uint64_t ns; > > > - > > > - ns = time.tv_sec * ODP_TIME_SEC_IN_NS; > > > - ns += time.tv_nsec; > > > +static time_global_t global; > > > > > > - return ns; > > > -} > > > +/* > > > + * Posix timespec based functions > > > + */ > > > > > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) > > > > Static functions in .c files will always be considered for inlining, so the > > inline qualifier is unnecessary here. You can use always_inline compiler > > attribute if the disassembly and run time performance is not as expected. > > > > Comment applies to nearly every function in this file. > > > > > { > > > odp_time_t time; > > > > > > - time.tv_sec = t2.tv_sec - t1.tv_sec; > > > - time.tv_nsec = t2.tv_nsec - t1.tv_nsec; > > > + time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec; > > > + time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec; > > > > > > - if (time.tv_nsec < 0) { > > > - time.tv_nsec += ODP_TIME_SEC_IN_NS; > > > - --time.tv_sec; > > > + if (time.spec.tv_nsec < 0) { > > > + time.spec.tv_nsec += ODP_TIME_SEC_IN_NS; > > > + --time.spec.tv_sec; > > > } > > > > > > return time; > > > } > > > > > > -static inline odp_time_t time_local(void) > > > +static inline odp_time_t time_spec_cur(void) > > > { > > > int ret; > > > odp_time_t time; > > > @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void) > > > if (odp_unlikely(ret != 0)) > > > ODP_ABORT("clock_gettime failed\n"); > > > > > > - time.tv_sec = sys_time.tv_sec; > > > - time.tv_nsec = sys_time.tv_nsec; > > > + time.spec.tv_sec = sys_time.tv_sec; > > > + time.spec.tv_nsec = sys_time.tv_nsec; > > > > > > - return time_diff(time, start_time); > > > + return time_spec_diff(time, global.start_time); > > > } > > > > > > -static inline int time_cmp(odp_time_t t2, odp_time_t t1) > > > +static inline uint64_t time_spec_res(void) > > > { > > > - if (t2.tv_sec < t1.tv_sec) > > > + int ret; > > > + struct timespec tres; > > > + > > > + ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > > > + if (odp_unlikely(ret != 0)) > > > + ODP_ABORT("clock_getres failed\n"); > > > + > > > + return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > > > +} > > > + > > > +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1) > > > +{ > > > + if (t2.spec.tv_sec < t1.spec.tv_sec) > > > return -1; > > > > > > - if (t2.tv_sec > t1.tv_sec) > > > + if (t2.spec.tv_sec > t1.spec.tv_sec) > > > return 1; > > > > > > - return t2.tv_nsec - t1.tv_nsec; > > > + return t2.spec.tv_nsec - t1.spec.tv_nsec; > > > } > > > > > > -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > > > +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2) > > > { > > > odp_time_t time; > > > > > > - time.tv_sec = t2.tv_sec + t1.tv_sec; > > > - time.tv_nsec = t2.tv_nsec + t1.tv_nsec; > > > + time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec; > > > + time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec; > > > > > > - if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > > > - time.tv_nsec -= ODP_TIME_SEC_IN_NS; > > > - ++time.tv_sec; > > > + if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > > > + time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS; > > > + ++time.spec.tv_sec; > > > } > > > > > > return time; > > > } > > > > > > -static inline odp_time_t time_local_from_ns(uint64_t ns) > > > +static inline uint64_t time_spec_to_ns(odp_time_t time) > > > +{ > > > + uint64_t ns; > > > + > > > + ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > > > + ns += time.spec.tv_nsec; > > > + > > > + return ns; > > > +} > > > + > > > +static inline odp_time_t time_spec_from_ns(uint64_t ns) > > > { > > > odp_time_t time; > > > > > > - time.tv_sec = ns / ODP_TIME_SEC_IN_NS; > > > - time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS; > > > + time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS; > > > + time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > > > > > > return time; > > > } > > > > > > -static inline void time_wait_until(odp_time_t time) > > > +/* > > > + * HW time counter based functions > > > + */ > > > + > > > +static inline odp_time_t time_hw_cur(void) > > > { > > > - odp_time_t cur; > > > + odp_time_t time; > > > > > > - do { > > > - cur = time_local(); > > > - } while (time_cmp(time, cur) > 0); > > > + time.hw.res = 0; > > > + time.hw.count = cpu_global_time() - global.hw_start; > > > + > > > + return time; > > > } > > > > > > -static inline uint64_t time_local_res(void) > > > +static inline uint64_t time_hw_res(void) > > > > I think 'freq' is a more descriptive name than 'res'. > > > > It would also be good to document at either the function declaration > > or the function definition (if multiple implementations differ) whether > > the frequency is in Hz or kHz. For example, on ARMv8 it is just a > > register read to get Hz. No estimation, and no lowering to kHz. > > > > > { > > > - int ret; > > > - struct timespec tres; > > > + /* Promise a bit lower resolution than average cycle counter > > > + * frequency */ > > > + return global.hw_freq_hz / 10; > > > +} > > > > > > - ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > > > - if (odp_unlikely(ret != 0)) > > > - ODP_ABORT("clock_getres failed\n"); > > > +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1) > > > +{ > > > + if (odp_likely(t2.hw.count > t1.hw.count)) > > > + return 1; > > > > > > - return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > > > + if (t2.hw.count < t1.hw.count) > > > + return -1; > > > + > > > + return 0; > > > +} > > > + > > > +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1) > > > +{ > > > + odp_time_t time; > > > + > > > + time.hw.res = 0; > > > + time.hw.count = t2.hw.count - t1.hw.count; > > > + > > > + return time; > > > +} > > > + > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > > > +{ > > > + odp_time_t time; > > > + > > > + time.hw.res = 0; > > > + time.hw.count = t1.hw.count + t2.hw.count; > > > + > > > + return time; > > > +} > > > > If a single u64 were used this code wouldn't need to exist. > > > > > +static inline uint64_t time_hw_to_ns(odp_time_t time) > > > +{ > > > + uint64_t nsec; > > > + uint64_t freq_hz = global.hw_freq_hz; > > > + uint64_t count = time.hw.count; > > > + uint64_t sec = 0; > > > + > > > + if (count >= freq_hz) { > > > + sec = count / freq_hz; > > > + count = count - sec * freq_hz; > > > + } > > > + > > > + nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz; > > > + > > > + return (sec * ODP_TIME_SEC_IN_NS) + nsec; > > > +} > > > + > > > +static inline odp_time_t time_hw_from_ns(uint64_t ns) > > > +{ > > > + odp_time_t time; > > > + uint64_t count; > > > + uint64_t freq_hz = global.hw_freq_hz; > > > + uint64_t sec = 0; > > > + > > > + if (ns >= ODP_TIME_SEC_IN_NS) { > > > + sec = ns / ODP_TIME_SEC_IN_NS; > > > + ns = ns - sec * ODP_TIME_SEC_IN_NS; > > > + } > > > + > > > + count = sec * freq_hz; > > > + count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS; > > > + > > > + time.hw.res = 0; > > > + time.hw.count = count; > > > + > > > + return time; > > > +} > > > + > > > +/* > > > + * Common functions > > > + */ > > > + > > > +static inline odp_time_t time_cur(void) > > > +{ > > > + if (global.use_hw) > > > + return time_hw_cur(); > > > + > > > + return time_spec_cur(); > > > +} > > > + > > > +static inline uint64_t time_res(void) > > > +{ > > > + if (global.use_hw) > > > + return time_hw_res(); > > > + > > > + return time_spec_res(); > > > +} > > > + > > > +static inline int time_cmp(odp_time_t t2, odp_time_t t1) > > > +{ > > > + if (global.use_hw) > > > + return time_hw_cmp(t2, t1); > > > + > > > + return time_spec_cmp(t2, t1); > > > +} > > > + > > > +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > > > +{ > > > + if (global.use_hw) > > > + return time_hw_diff(t2, t1); > > > + > > > + return time_spec_diff(t2, t1); > > > +} > > > + > > > +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > > > +{ > > > + if (global.use_hw) > > > + return time_hw_sum(t1, t2); > > > + > > > + return time_spec_sum(t1, t2); > > > +} > > > + > > > +static inline uint64_t time_to_ns(odp_time_t time) > > > +{ > > > + if (global.use_hw) > > > + return time_hw_to_ns(time); > > > + > > > + return time_spec_to_ns(time); > > > +} > > > + > > > +static inline odp_time_t time_from_ns(uint64_t ns) > > > +{ > > > + if (global.use_hw) > > > + return time_hw_from_ns(ns); > > > + > > > + return time_spec_from_ns(ns); > > > +} > > > + > > > +static inline void time_wait_until(odp_time_t time) > > > +{ > > > + odp_time_t cur; > > > + > > > + do { > > > + cur = time_cur(); > > > + } while (time_cmp(time, cur) > 0); > > > } > > > > > > odp_time_t odp_time_local(void) > > > { > > > - return time_local(); > > > + return time_cur(); > > > } > > > > > > odp_time_t odp_time_global(void) > > > { > > > - return time_local(); > > > + return time_cur(); > > > } > > > > > > odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1) > > > @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time) > > > > > > odp_time_t odp_time_local_from_ns(uint64_t ns) > > > { > > > - return time_local_from_ns(ns); > > > + return time_from_ns(ns); > > > } > > > > > > odp_time_t odp_time_global_from_ns(uint64_t ns) > > > { > > > - return time_local_from_ns(ns); > > > + return time_from_ns(ns); > > > } > > > > > > int odp_time_cmp(odp_time_t t2, odp_time_t t1) > > > @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t > > t2) > > > > > > uint64_t odp_time_local_res(void) > > > { > > > - return time_local_res(); > > > + return time_res(); > > > } > > > > > > uint64_t odp_time_global_res(void) > > > { > > > - return time_local_res(); > > > + return time_res(); > > > } > > > > > > void odp_time_wait_ns(uint64_t ns) > > > { > > > - odp_time_t cur = time_local(); > > > - odp_time_t wait = time_local_from_ns(ns); > > > + odp_time_t cur = time_cur(); > > > + odp_time_t wait = time_from_ns(ns); > > > odp_time_t end_time = time_sum(cur, wait); > > > > > > time_wait_until(end_time); > > > @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time) > > > > > > int odp_time_init_global(void) > > > { > > > - int ret; > > > - struct timespec time; > > > - > > > - ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time); > > > - if (ret) { > > > - start_time = ODP_TIME_NULL; > > > - } else { > > > - start_time.tv_sec = time.tv_sec; > > > - start_time.tv_nsec = time.tv_nsec; > > > + struct timespec sys_time; > > > + int ret = 0; > > > + > > > + memset(&global, 0, sizeof(time_global_t)); > > > + > > > + if (cpu_has_global_time()) { > > > + global.use_hw = 1; > > > + global.hw_freq_hz = cpu_global_time_freq(); > > > + > > > + if (global.hw_freq_hz == 0) > > > + return -1; > > > + > > > + printf("HW time counter freq: %" PRIu64 " hz\n\n", > > > + global.hw_freq_hz); > > > + > > > + global.hw_start = cpu_global_time(); > > > + return 0; > > > + } > > > + > > > + global.start_time = ODP_TIME_NULL; > > > + > > > + ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time); > > > + if (ret == 0) { > > > + global.start_time.spec.tv_sec = sys_time.tv_sec; > > > + global.start_time.spec.tv_nsec = sys_time.tv_nsec; > > > } > > > > > > return ret; > > > -- > > > 2.11.0 > > > > >
> > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c > b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > index c8cf27b6..9ba601a3 100644 > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > @@ -3,7 +3,14 @@ > > * > > * SPDX-License-Identifier: BSD-3-Clause > > */ > > + > > +#include <odp_posix_extensions.h> > > + > > #include <odp/api/cpu.h> > > +#include <odp_time_internal.h> > > +#include <odp_debug_internal.h> > > + > > +#include <time.h> > > > > uint64_t odp_cpu_cycles(void) > > { > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > > { > > return 1; > > } > > + > > +uint64_t cpu_global_time(void) > > +{ > > + return odp_cpu_cycles(); > > A cycle counter cannot always be used to measure time. Even on x86, > odp_cpu_cycles() will return the value of RDTSC which is not actually > representative of the cycle count. Even if the x86 processor is set > to a fixed frequency, the Invariant TSC may run at a different fixed > frequency. Please take a look at the odp_tick_t proposal here: > > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- > eL0oGLAQ4OM/edit?usp=sharing > From coverletter: "This patch set modifies time implementation to use TSC when running on a x86 CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time is used as before. TSC is much more efficient both in performance and latency/jitter wise than Linux system call. This can be seen also with scheduler latency test which time stamps events with this API. All latency measurements (min, ave, max) improved significantly." This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC. > > +} > > + > > +#define SEC_IN_NS 1000000000ULL > > + > > +/* Measure TSC frequency. Frequency information registers are defined > for x86, > > + * but those are often not enumerated. */ > > +uint64_t cpu_global_time_freq(void) > > +{ > > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's > frequency in kHz. It would be better to use this when running under a > hypervisor. It appears to work under VMware as well. Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any. Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it. > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h > b/platform/linux-generic/include/odp/api/plat/time_types.h > > index 4847f3b1..8ae7ce82 100644 > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > @@ -26,11 +26,28 @@ extern "C" { > > * the linux timespec structure, which is dependent on POSIX extension > level. > > */ > > typedef struct odp_time_t { > > - int64_t tv_sec; /**< @internal Seconds */ > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > + union { > > + /** @internal Posix timespec */ > > + struct { > > + /** @internal Seconds */ > > + int64_t tv_sec; > > + > > + /** @internal Nanoseconds */ > > + int64_t tv_nsec; > > + } spec; > > + > > + /** @internal HW time counter */ > > + struct { > > + /** @internal Counter value */ > > + uint64_t count; > > + > > + /** @internal Reserved */ > > + uint64_t res; > > + } hw; > > + }; > > A processor's tick counter cannot be used to measure calendar time by > itself. The delta between two ticks, however, can be converted to > calendar time. > > Please see the proposal that introduces odp_tick_t which eliminates > the wasted u64 reserved field in the union above. Nothing is wasted here today. We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems. Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ? > > +/* > > + * Posix timespec based functions > > + */ > > > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) > > Static functions in .c files will always be considered for inlining, so > the > inline qualifier is unnecessary here. You can use always_inline compiler > attribute if the disassembly and run time performance is not as expected. > > Comment applies to nearly every function in this file. This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path. > > > > -static inline uint64_t time_local_res(void) > > +static inline uint64_t time_hw_res(void) > > I think 'freq' is a more descriptive name than 'res'. > > It would also be good to document at either the function declaration > or the function definition (if multiple implementations differ) whether > the frequency is in Hz or kHz. For example, on ARMv8 it is just a > register read to get Hz. No estimation, and no lowering to kHz. This implements resolution API functions ... /** * Global time resolution in hertz * * @return Global time resolution in hertz */ uint64_t odp_time_global_res(void); ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative. > > + > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > > +{ > > + odp_time_t time; > > + > > + time.hw.res = 0; > > + time.hw.count = t1.hw.count + t2.hw.count; > > + > > + return time; > > +} > > If a single u64 were used this code wouldn't need to exist. Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2. -Petri
On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > index c8cf27b6..9ba601a3 100644 > > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > @@ -3,7 +3,14 @@ > > > * > > > * SPDX-License-Identifier: BSD-3-Clause > > > */ > > > + > > > +#include <odp_posix_extensions.h> > > > + > > > #include <odp/api/cpu.h> > > > +#include <odp_time_internal.h> > > > +#include <odp_debug_internal.h> > > > + > > > +#include <time.h> > > > > > > uint64_t odp_cpu_cycles(void) > > > { > > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > > > { > > > return 1; > > > } > > > + > > > +uint64_t cpu_global_time(void) > > > +{ > > > + return odp_cpu_cycles(); > > > > A cycle counter cannot always be used to measure time. Even on x86, > > odp_cpu_cycles() will return the value of RDTSC which is not actually > > representative of the cycle count. Even if the x86 processor is set > > to a fixed frequency, the Invariant TSC may run at a different fixed > > frequency. Please take a look at the odp_tick_t proposal here: > > > > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- > > eL0oGLAQ4OM/edit?usp=sharing > > > > From coverletter: > "This patch set modifies time implementation to use TSC when running on a x86 > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time > is used as before. TSC is much more efficient both in performance and > latency/jitter wise than Linux system call. This can be seen also with > scheduler latency test which time stamps events with this API. All latency > measurements (min, ave, max) improved significantly." > > This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC. > > > > > > +} > > > + > > > +#define SEC_IN_NS 1000000000ULL > > > + > > > +/* Measure TSC frequency. Frequency information registers are defined > > for x86, > > > + * but those are often not enumerated. */ > > > +uint64_t cpu_global_time_freq(void) > > > +{ > > > > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's > > frequency in kHz. It would be better to use this when running under a > > hypervisor. It appears to work under VMware as well. > > > Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any. > > Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it. > > > > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h > > b/platform/linux-generic/include/odp/api/plat/time_types.h > > > index 4847f3b1..8ae7ce82 100644 > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > > @@ -26,11 +26,28 @@ extern "C" { > > > * the linux timespec structure, which is dependent on POSIX extension > > level. > > > */ > > > typedef struct odp_time_t { > > > - int64_t tv_sec; /**< @internal Seconds */ > > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > > + union { > > > + /** @internal Posix timespec */ > > > + struct { > > > + /** @internal Seconds */ > > > + int64_t tv_sec; > > > + > > > + /** @internal Nanoseconds */ > > > + int64_t tv_nsec; > > > + } spec; > > > + > > > + /** @internal HW time counter */ > > > + struct { > > > + /** @internal Counter value */ > > > + uint64_t count; > > > + > > > + /** @internal Reserved */ > > > + uint64_t res; > > > + } hw; > > > + }; > > > > A processor's tick counter cannot be used to measure calendar time by > > itself. The delta between two ticks, however, can be converted to > > calendar time. > > > > Please see the proposal that introduces odp_tick_t which eliminates > > the wasted u64 reserved field in the union above. > > > Nothing is wasted here today. The 64-bit CPU time is stored in a 128-bit data structure where 64-bits are wasted. You can use just a 64-bit type and then convert that to a timespec (using a starting timestamp taken on global init) if needed. > We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems. > > Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ? > > > > > > > +/* > > > + * Posix timespec based functions > > > + */ > > > > > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) > > > > Static functions in .c files will always be considered for inlining, so > > the > > inline qualifier is unnecessary here. You can use always_inline compiler > > attribute if the disassembly and run time performance is not as expected. > > > > Comment applies to nearly every function in this file. > > > This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path. I am saying that adding the 'inline' qualifier here doesn't actually do anything because the compiler will consider static functions for inlining anyways. > > > > > > > -static inline uint64_t time_local_res(void) > > > +static inline uint64_t time_hw_res(void) > > > > I think 'freq' is a more descriptive name than 'res'. > > > > It would also be good to document at either the function declaration > > or the function definition (if multiple implementations differ) whether > > the frequency is in Hz or kHz. For example, on ARMv8 it is just a > > register read to get Hz. No estimation, and no lowering to kHz. > > > This implements resolution API functions ... > > /** > * Global time resolution in hertz > * > * @return Global time resolution in hertz > */ > uint64_t odp_time_global_res(void); > > > ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative. > > > > > + > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > > > +{ > > > + odp_time_t time; > > > + > > > + time.hw.res = 0; > > > + time.hw.count = t1.hw.count + t2.hw.count; > > > + > > > + return time; > > > +} > > > > If a single u64 were used this code wouldn't need to exist. > > Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2. If the CPU time was stored in a 64-bit type, you can use arithmetic operators on the values directly instead of doing arithmetic on a 128-bit compound datatype where 64-bits are unused. This is the union above. > -Petri >
On 04/25 12:25:03, Brian Brooks wrote: > On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > > index c8cf27b6..9ba601a3 100644 > > > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > > > > @@ -3,7 +3,14 @@ > > > > * > > > > * SPDX-License-Identifier: BSD-3-Clause > > > > */ > > > > + > > > > +#include <odp_posix_extensions.h> > > > > + > > > > #include <odp/api/cpu.h> > > > > +#include <odp_time_internal.h> > > > > +#include <odp_debug_internal.h> > > > > + > > > > +#include <time.h> > > > > > > > > uint64_t odp_cpu_cycles(void) > > > > { > > > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > > > > { > > > > return 1; > > > > } > > > > + > > > > +uint64_t cpu_global_time(void) > > > > +{ > > > > + return odp_cpu_cycles(); > > > > > > A cycle counter cannot always be used to measure time. Even on x86, > > > odp_cpu_cycles() will return the value of RDTSC which is not actually > > > representative of the cycle count. Even if the x86 processor is set > > > to a fixed frequency, the Invariant TSC may run at a different fixed > > > frequency. Please take a look at the odp_tick_t proposal here: > > > > > > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- > > > eL0oGLAQ4OM/edit?usp=sharing > > > > > > > From coverletter: > > "This patch set modifies time implementation to use TSC when running on a x86 > > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time > > is used as before. TSC is much more efficient both in performance and > > latency/jitter wise than Linux system call. This can be seen also with > > scheduler latency test which time stamps events with this API. All latency > > measurements (min, ave, max) improved significantly." > > > > This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC. What sort of timing accuracy is expected from the app? From benchmarking the maximum single-threaded rate of these reads: x86_64: read 7 ns/op read_sync 22 ns/op A57: read 4 ns/op read_sync 26 ns/op read_sync issues a synchronizing instruction for greater timing accuracy but clearly takes more time to return the time value read from the core. > > > > +} > > > > + > > > > +#define SEC_IN_NS 1000000000ULL > > > > + > > > > +/* Measure TSC frequency. Frequency information registers are defined > > > for x86, > > > > + * but those are often not enumerated. */ > > > > +uint64_t cpu_global_time_freq(void) > > > > +{ > > > > > > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's > > > frequency in kHz. It would be better to use this when running under a > > > hypervisor. It appears to work under VMware as well. > > > > > > Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any. > > > > Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it. > > > > > > > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h > > > b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > index 4847f3b1..8ae7ce82 100644 > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > @@ -26,11 +26,28 @@ extern "C" { > > > > * the linux timespec structure, which is dependent on POSIX extension > > > level. > > > > */ > > > > typedef struct odp_time_t { > > > > - int64_t tv_sec; /**< @internal Seconds */ > > > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > > > + union { > > > > + /** @internal Posix timespec */ > > > > + struct { > > > > + /** @internal Seconds */ > > > > + int64_t tv_sec; > > > > + > > > > + /** @internal Nanoseconds */ > > > > + int64_t tv_nsec; > > > > + } spec; > > > > + > > > > + /** @internal HW time counter */ > > > > + struct { > > > > + /** @internal Counter value */ > > > > + uint64_t count; > > > > + > > > > + /** @internal Reserved */ > > > > + uint64_t res; > > > > + } hw; > > > > + }; > > > > > > A processor's tick counter cannot be used to measure calendar time by > > > itself. The delta between two ticks, however, can be converted to > > > calendar time. > > > > > > Please see the proposal that introduces odp_tick_t which eliminates > > > the wasted u64 reserved field in the union above. > > > > > > Nothing is wasted here today. > > The 64-bit CPU time is stored in a 128-bit data structure where 64-bits > are wasted. You can use just a 64-bit type and then convert that to > a timespec (using a starting timestamp taken on global init) if needed. > > > We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems. > > > > Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ? > > > > > > > > > > > > +/* > > > > + * Posix timespec based functions > > > > + */ > > > > > > > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > > > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) > > > > > > Static functions in .c files will always be considered for inlining, so > > > the > > > inline qualifier is unnecessary here. You can use always_inline compiler > > > attribute if the disassembly and run time performance is not as expected. > > > > > > Comment applies to nearly every function in this file. > > > > > > This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path. > > I am saying that adding the 'inline' qualifier here doesn't actually do anything > because the compiler will consider static functions for inlining anyways. > > > > > > > > > > > -static inline uint64_t time_local_res(void) > > > > +static inline uint64_t time_hw_res(void) > > > > > > I think 'freq' is a more descriptive name than 'res'. > > > > > > It would also be good to document at either the function declaration > > > or the function definition (if multiple implementations differ) whether > > > the frequency is in Hz or kHz. For example, on ARMv8 it is just a > > > register read to get Hz. No estimation, and no lowering to kHz. > > > > > > This implements resolution API functions ... > > > > /** > > * Global time resolution in hertz > > * > > * @return Global time resolution in hertz > > */ > > uint64_t odp_time_global_res(void); > > > > > > ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative. > > > > > > > > + > > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > > > > +{ > > > > + odp_time_t time; > > > > + > > > > + time.hw.res = 0; > > > > + time.hw.count = t1.hw.count + t2.hw.count; > > > > + > > > > + return time; > > > > +} > > > > > > If a single u64 were used this code wouldn't need to exist. > > > > Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2. > > If the CPU time was stored in a 64-bit type, you can use arithmetic operators > on the values directly instead of doing arithmetic on a 128-bit compound > datatype where 64-bits are unused. This is the union above. > > > -Petri > >
On 04/25/17 21:51, Brian Brooks wrote: > On 04/25 12:25:03, Brian Brooks wrote: >> On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote: >>>>> diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c >>>> b/platform/linux-generic/arch/x86/odp_cpu_arch.c >>>>> index c8cf27b6..9ba601a3 100644 >>>>> --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c >>>>> +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c >>>>> @@ -3,7 +3,14 @@ >>>>> * >>>>> * SPDX-License-Identifier: BSD-3-Clause >>>>> */ >>>>> + >>>>> +#include <odp_posix_extensions.h> >>>>> + >>>>> #include <odp/api/cpu.h> >>>>> +#include <odp_time_internal.h> >>>>> +#include <odp_debug_internal.h> >>>>> + >>>>> +#include <time.h> >>>>> >>>>> uint64_t odp_cpu_cycles(void) >>>>> { >>>>> @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) >>>>> { >>>>> return 1; >>>>> } >>>>> + >>>>> +uint64_t cpu_global_time(void) >>>>> +{ >>>>> + return odp_cpu_cycles(); >>>> >>>> A cycle counter cannot always be used to measure time. Even on x86, >>>> odp_cpu_cycles() will return the value of RDTSC which is not actually >>>> representative of the cycle count. Even if the x86 processor is set >>>> to a fixed frequency, the Invariant TSC may run at a different fixed >>>> frequency. Please take a look at the odp_tick_t proposal here: >>>> >>>> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- >>>> eL0oGLAQ4OM/edit?usp=sharing >>>> >>> >>> From coverletter: >>> "This patch set modifies time implementation to use TSC when running on a x86 >>> CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time >>> is used as before. TSC is much more efficient both in performance and >>> latency/jitter wise than Linux system call. This can be seen also with >>> scheduler latency test which time stamps events with this API. All latency >>> measurements (min, ave, max) improved significantly." >>> >>> This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC. > > What sort of timing accuracy is expected from the app? > > From benchmarking the maximum single-threaded rate of these reads: > > x86_64: > > read 7 ns/op > read_sync 22 ns/op > > A57: > > read 4 ns/op > read_sync 26 ns/op > > read_sync issues a synchronizing instruction for greater timing accuracy > but clearly takes more time to return the time value read from the core. > it has to be depend on cpu frequency. Maxim. >>>>> +} >>>>> + >>>>> +#define SEC_IN_NS 1000000000ULL >>>>> + >>>>> +/* Measure TSC frequency. Frequency information registers are defined >>>> for x86, >>>>> + * but those are often not enumerated. */ >>>>> +uint64_t cpu_global_time_freq(void) >>>>> +{ >>>> >>>> The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's >>>> frequency in kHz. It would be better to use this when running under a >>>> hypervisor. It appears to work under VMware as well. >>> >>> >>> Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any. >>> >>> Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it. >>> >>> >>>>> diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h >>>> b/platform/linux-generic/include/odp/api/plat/time_types.h >>>>> index 4847f3b1..8ae7ce82 100644 >>>>> --- a/platform/linux-generic/include/odp/api/plat/time_types.h >>>>> +++ b/platform/linux-generic/include/odp/api/plat/time_types.h >>>>> @@ -26,11 +26,28 @@ extern "C" { >>>>> * the linux timespec structure, which is dependent on POSIX extension >>>> level. >>>>> */ >>>>> typedef struct odp_time_t { >>>>> - int64_t tv_sec; /**< @internal Seconds */ >>>>> - int64_t tv_nsec; /**< @internal Nanoseconds */ >>>>> + union { >>>>> + /** @internal Posix timespec */ >>>>> + struct { >>>>> + /** @internal Seconds */ >>>>> + int64_t tv_sec; >>>>> + >>>>> + /** @internal Nanoseconds */ >>>>> + int64_t tv_nsec; >>>>> + } spec; >>>>> + >>>>> + /** @internal HW time counter */ >>>>> + struct { >>>>> + /** @internal Counter value */ >>>>> + uint64_t count; >>>>> + >>>>> + /** @internal Reserved */ >>>>> + uint64_t res; >>>>> + } hw; >>>>> + }; >>>> >>>> A processor's tick counter cannot be used to measure calendar time by >>>> itself. The delta between two ticks, however, can be converted to >>>> calendar time. >>>> >>>> Please see the proposal that introduces odp_tick_t which eliminates >>>> the wasted u64 reserved field in the union above. >>> >>> >>> Nothing is wasted here today. >> >> The 64-bit CPU time is stored in a 128-bit data structure where 64-bits >> are wasted. You can use just a 64-bit type and then convert that to >> a timespec (using a starting timestamp taken on global init) if needed. >> >>> We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems. >>> >>> Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ? >>> >>> >>> >>> >>>>> +/* >>>>> + * Posix timespec based functions >>>>> + */ >>>>> >>>>> -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) >>>>> +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) >>>> >>>> Static functions in .c files will always be considered for inlining, so >>>> the >>>> inline qualifier is unnecessary here. You can use always_inline compiler >>>> attribute if the disassembly and run time performance is not as expected. >>>> >>>> Comment applies to nearly every function in this file. >>> >>> >>> This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path. >> >> I am saying that adding the 'inline' qualifier here doesn't actually do anything >> because the compiler will consider static functions for inlining anyways. >> >>> >>>>> >>>>> -static inline uint64_t time_local_res(void) >>>>> +static inline uint64_t time_hw_res(void) >>>> >>>> I think 'freq' is a more descriptive name than 'res'. >>>> >>>> It would also be good to document at either the function declaration >>>> or the function definition (if multiple implementations differ) whether >>>> the frequency is in Hz or kHz. For example, on ARMv8 it is just a >>>> register read to get Hz. No estimation, and no lowering to kHz. >>> >>> >>> This implements resolution API functions ... >>> >>> /** >>> * Global time resolution in hertz >>> * >>> * @return Global time resolution in hertz >>> */ >>> uint64_t odp_time_global_res(void); >>> >>> >>> ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative. >>> >>> >>>>> + >>>>> +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) >>>>> +{ >>>>> + odp_time_t time; >>>>> + >>>>> + time.hw.res = 0; >>>>> + time.hw.count = t1.hw.count + t2.hw.count; >>>>> + >>>>> + return time; >>>>> +} >>>> >>>> If a single u64 were used this code wouldn't need to exist. >>> >>> Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2. >> >> If the CPU time was stored in a 64-bit type, you can use arithmetic operators >> on the values directly instead of doing arithmetic on a 128-bit compound >> datatype where 64-bits are unused. This is the union above. >> >>> -Petri >>>
On 04/25 22:47:11, Maxim Uvarov wrote: > On 04/25/17 21:51, Brian Brooks wrote: > > On 04/25 12:25:03, Brian Brooks wrote: > >> On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote: > >>>>> diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c > >>>> b/platform/linux-generic/arch/x86/odp_cpu_arch.c > >>>>> index c8cf27b6..9ba601a3 100644 > >>>>> --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > >>>>> +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > >>>>> @@ -3,7 +3,14 @@ > >>>>> * > >>>>> * SPDX-License-Identifier: BSD-3-Clause > >>>>> */ > >>>>> + > >>>>> +#include <odp_posix_extensions.h> > >>>>> + > >>>>> #include <odp/api/cpu.h> > >>>>> +#include <odp_time_internal.h> > >>>>> +#include <odp_debug_internal.h> > >>>>> + > >>>>> +#include <time.h> > >>>>> > >>>>> uint64_t odp_cpu_cycles(void) > >>>>> { > >>>>> @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > >>>>> { > >>>>> return 1; > >>>>> } > >>>>> + > >>>>> +uint64_t cpu_global_time(void) > >>>>> +{ > >>>>> + return odp_cpu_cycles(); > >>>> > >>>> A cycle counter cannot always be used to measure time. Even on x86, > >>>> odp_cpu_cycles() will return the value of RDTSC which is not actually > >>>> representative of the cycle count. Even if the x86 processor is set > >>>> to a fixed frequency, the Invariant TSC may run at a different fixed > >>>> frequency. Please take a look at the odp_tick_t proposal here: > >>>> > >>>> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- > >>>> eL0oGLAQ4OM/edit?usp=sharing > >>>> > >>> > >>> From coverletter: > >>> "This patch set modifies time implementation to use TSC when running on a x86 > >>> CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time > >>> is used as before. TSC is much more efficient both in performance and > >>> latency/jitter wise than Linux system call. This can be seen also with > >>> scheduler latency test which time stamps events with this API. All latency > >>> measurements (min, ave, max) improved significantly." > >>> > >>> This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC. > > > > What sort of timing accuracy is expected from the app? > > > > From benchmarking the maximum single-threaded rate of these reads: > > > > x86_64: > > > > read 7 ns/op > > read_sync 22 ns/op > > > > A57: > > > > read 4 ns/op > > read_sync 26 ns/op > > > > read_sync issues a synchronizing instruction for greater timing accuracy > > but clearly takes more time to return the time value read from the core. > > > > > it has to be depend on cpu frequency. > > Maxim. We are showing the difference between 'read' and 'read_sync' on the same machine here.
On 24 April 2017 at 03:07, Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-labs.com> wrote: >> > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c >> b/platform/linux-generic/arch/x86/odp_cpu_arch.c >> > index c8cf27b6..9ba601a3 100644 >> > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c >> > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c >> > @@ -3,7 +3,14 @@ >> > * >> > * SPDX-License-Identifier: BSD-3-Clause >> > */ >> > + >> > +#include <odp_posix_extensions.h> >> > + >> > #include <odp/api/cpu.h> >> > +#include <odp_time_internal.h> >> > +#include <odp_debug_internal.h> >> > + >> > +#include <time.h> >> > >> > uint64_t odp_cpu_cycles(void) >> > { >> > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) >> > { >> > return 1; >> > } >> > + >> > +uint64_t cpu_global_time(void) >> > +{ >> > + return odp_cpu_cycles(); >> >> A cycle counter cannot always be used to measure time. Even on x86, >> odp_cpu_cycles() will return the value of RDTSC which is not actually >> representative of the cycle count. Even if the x86 processor is set >> to a fixed frequency, the Invariant TSC may run at a different fixed >> frequency. Please take a look at the odp_tick_t proposal here: >> >> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW- >> eL0oGLAQ4OM/edit?usp=sharing >> > > From coverletter: > "This patch set modifies time implementation to use TSC when running on a x86 > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system time > is used as before. TSC is much more efficient both in performance and > latency/jitter wise than Linux system call. This can be seen also with > scheduler latency test which time stamps events with this API. All latency > measurements (min, ave, max) improved significantly." odp_sched_latency currently uses clock_gettime. It is my understanding that clock_gettime does not have the over head of the system call. Can you elaborate more on the 'improved significantly' part? > > This function (cpu_global_time()) is called only when we have first checked that TSC is invariant. Also we measure the TSC frequency in that case. This function is defined in the same file as cpu_cycles(), and the file is x86 specific. So, we know what we are doing, and just re-using the code to read TSC. > > > >> > +} >> > + >> > +#define SEC_IN_NS 1000000000ULL >> > + >> > +/* Measure TSC frequency. Frequency information registers are defined >> for x86, >> > + * but those are often not enumerated. */ >> > +uint64_t cpu_global_time_freq(void) >> > +{ >> >> The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's >> frequency in kHz. It would be better to use this when running under a >> hypervisor. It appears to work under VMware as well. > > > Sure, that can be done later, if we want to add any code into odp-linux that would behave differently under VM vs. host. Today, we don't have any. > > Also there's leaf 0x15, which should give you TSC frequency but apparently is not enumerated often. So, I ended up to do the same as DPDK - just measure it. > > >> > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h >> b/platform/linux-generic/include/odp/api/plat/time_types.h >> > index 4847f3b1..8ae7ce82 100644 >> > --- a/platform/linux-generic/include/odp/api/plat/time_types.h >> > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h >> > @@ -26,11 +26,28 @@ extern "C" { >> > * the linux timespec structure, which is dependent on POSIX extension >> level. >> > */ >> > typedef struct odp_time_t { >> > - int64_t tv_sec; /**< @internal Seconds */ >> > - int64_t tv_nsec; /**< @internal Nanoseconds */ >> > + union { >> > + /** @internal Posix timespec */ >> > + struct { >> > + /** @internal Seconds */ >> > + int64_t tv_sec; >> > + >> > + /** @internal Nanoseconds */ >> > + int64_t tv_nsec; >> > + } spec; >> > + >> > + /** @internal HW time counter */ >> > + struct { >> > + /** @internal Counter value */ >> > + uint64_t count; >> > + >> > + /** @internal Reserved */ >> > + uint64_t res; >> > + } hw; >> > + }; >> >> A processor's tick counter cannot be used to measure calendar time by >> itself. The delta between two ticks, however, can be converted to >> calendar time. >> >> Please see the proposal that introduces odp_tick_t which eliminates >> the wasted u64 reserved field in the union above. > > > Nothing is wasted here today. We need to support Posix timespec anyway. Timespec is used when there's no x86 invariant TSC available - so all non-x86 and even on some x86 systems. > > Also "count" here starts from zero (we save cpu counter value at start). So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that starts from zero) is not good for calendar time ? > > > > >> > +/* >> > + * Posix timespec based functions >> > + */ >> > >> > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) >> > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) >> >> Static functions in .c files will always be considered for inlining, so >> the >> inline qualifier is unnecessary here. You can use always_inline compiler >> attribute if the disassembly and run time performance is not as expected. >> >> Comment applies to nearly every function in this file. > > > This is because all our code (in this and other files) does "static inline" for functions that we want to be inlined. Static is used for functions that are static, but we don't care if those are inlined (slow path). Most time API functions are fast path. > > >> > >> > -static inline uint64_t time_local_res(void) >> > +static inline uint64_t time_hw_res(void) >> >> I think 'freq' is a more descriptive name than 'res'. >> >> It would also be good to document at either the function declaration >> or the function definition (if multiple implementations differ) whether >> the frequency is in Hz or kHz. For example, on ARMv8 it is just a >> register read to get Hz. No estimation, and no lowering to kHz. > > > This implements resolution API functions ... > > /** > * Global time resolution in hertz > * > * @return Global time resolution in hertz > */ > uint64_t odp_time_global_res(void); > > > ... so, from that context you know that it's Hz. In this implementation, I lower the measured Hz to be conservative. > > >> > + >> > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) >> > +{ >> > + odp_time_t time; >> > + >> > + time.hw.res = 0; >> > + time.hw.count = t1.hw.count + t2.hw.count; >> > + >> > + return time; >> > +} >> >> If a single u64 were used this code wouldn't need to exist. > > Which code? hw.res = 0 ? It not actually used for anything and could be removed. Although, the performance gain would not be huge. Anyway, I'll remove it for v2. > > -Petri >
On 21 April 2017 at 08:11, Petri Savolainen <petri.savolainen@linaro.org> wrote: > Use 64 bit HW time counter when available. It is used on > x86 when invariant TSC CPU flag indicates that TSC frequency > is constant. Otherwise, the system time is used as before. Direct > HW time counter usage avoids system call, and related latency > and performance issues. > > Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org> > --- > platform/linux-generic/Makefile.am | 1 + > platform/linux-generic/arch/arm/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/default/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/mips64/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/powerpc/odp_cpu_arch.c | 11 + > platform/linux-generic/arch/x86/cpu_flags.c | 9 + > platform/linux-generic/arch/x86/odp_cpu_arch.c | 59 ++++ > .../include/odp/api/plat/time_types.h | 23 +- > platform/linux-generic/include/odp_time_internal.h | 24 ++ > platform/linux-generic/odp_time.c | 303 ++++++++++++++++----- > 10 files changed, 398 insertions(+), 65 deletions(-) > create mode 100644 platform/linux-generic/include/odp_time_internal.h > > diff --git a/platform/linux-generic/Makefile.am b/platform/linux-generic/Makefile.am > index 60b7f849..ed66fecf 100644 > --- a/platform/linux-generic/Makefile.am > +++ b/platform/linux-generic/Makefile.am > @@ -171,6 +171,7 @@ noinst_HEADERS = \ > ${srcdir}/include/odp_schedule_if.h \ > ${srcdir}/include/odp_sorted_list_internal.h \ > ${srcdir}/include/odp_shm_internal.h \ > + ${srcdir}/include/odp_time_internal.h \ > ${srcdir}/include/odp_timer_internal.h \ > ${srcdir}/include/odp_timer_wheel_internal.h \ > ${srcdir}/include/odp_traffic_mngr_internal.h \ > diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c b/platform/linux-generic/arch/arm/odp_cpu_arch.c > index 2ac223e0..3a87f09c 100644 > --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c > @@ -13,6 +13,7 @@ > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > > #define GIGA 1000000000 > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c b/platform/linux-generic/arch/default/odp_cpu_arch.c > index 2ac223e0..3a87f09c 100644 > --- a/platform/linux-generic/arch/default/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c > @@ -13,6 +13,7 @@ > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > > #define GIGA 1000000000 > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > index 646acf9c..a9a94531 100644 > --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c > @@ -7,6 +7,7 @@ > #include <odp/api/cpu.h> > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > +#include <odp_time_internal.h> > > uint64_t odp_cpu_cycles(void) > { > @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > index 2ac223e0..3a87f09c 100644 > --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c > @@ -13,6 +13,7 @@ > #include <odp/api/hints.h> > #include <odp/api/system_info.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > > #define GIGA 1000000000 > > @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return 0; > +} > + > +uint64_t cpu_global_time_freq(void) > +{ > + return 0; > +} > diff --git a/platform/linux-generic/arch/x86/cpu_flags.c b/platform/linux-generic/arch/x86/cpu_flags.c > index 8fb9477a..cde8ad3e 100644 > --- a/platform/linux-generic/arch/x86/cpu_flags.c > +++ b/platform/linux-generic/arch/x86/cpu_flags.c > @@ -38,6 +38,7 @@ > */ > > #include <arch/x86/cpu_flags.h> > +#include <odp_time_internal.h> > #include <stdio.h> > #include <stdint.h> > > @@ -347,3 +348,11 @@ void cpu_flags_print_all(void) > > printf("\n\n"); > } > + > +int cpu_has_global_time(void) > +{ > + if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0) > + return 1; > + > + return 0; > +} > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c b/platform/linux-generic/arch/x86/odp_cpu_arch.c > index c8cf27b6..9ba601a3 100644 > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c > @@ -3,7 +3,14 @@ > * > * SPDX-License-Identifier: BSD-3-Clause > */ > + > +#include <odp_posix_extensions.h> > + > #include <odp/api/cpu.h> > +#include <odp_time_internal.h> > +#include <odp_debug_internal.h> > + > +#include <time.h> > > uint64_t odp_cpu_cycles(void) > { > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) > { > return 1; > } > + > +uint64_t cpu_global_time(void) > +{ > + return odp_cpu_cycles(); > +} > + > +#define SEC_IN_NS 1000000000ULL > + > +/* Measure TSC frequency. Frequency information registers are defined for x86, > + * but those are often not enumerated. */ > +uint64_t cpu_global_time_freq(void) > +{ > + struct timespec sleep, ts1, ts2; > + uint64_t t1, t2, ts_nsec, cycles, hz; > + int i; > + uint64_t avg = 0; > + int rounds = 4; > + > + for (i = 0; i < rounds; i++) { > + sleep.tv_sec = 0; > + sleep.tv_nsec = SEC_IN_NS / 10; > + > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) { > + ODP_DBG("clock_gettime failed\n"); > + return 0; > + } > + > + t1 = cpu_global_time(); > + > + if (nanosleep(&sleep, NULL) < 0) { > + ODP_DBG("nanosleep failed\n"); > + return 0; > + } > + > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) { > + ODP_DBG("clock_gettime failed\n"); > + return 0; > + } > + > + t2 = cpu_global_time(); > + > + ts_nsec = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS; > + ts_nsec += ts2.tv_nsec - ts1.tv_nsec; > + > + cycles = t2 - t1; > + > + hz = (cycles * SEC_IN_NS) / ts_nsec; > + avg += hz; > + } > + > + return avg / rounds; > +} This function is not very accurate. Ideally, ts1 and t1 (ts2 and t2) should be read at the same instance for this to be accurate. There is also the possibility that the 'rdtsc' in cpu_global_time might get executed ahead or later than what is in the code here. Since this is called during init, the 'rounds' can be increased to a higher value to get a better average. Initial values can be discarded to ignore the cache warming latencies. We should fall back to this only if the frequency information registers are not available. There is a good white paper on methods to use for measuring cycles for code that takes small amount of cycles. http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h b/platform/linux-generic/include/odp/api/plat/time_types.h > index 4847f3b1..8ae7ce82 100644 > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > @@ -26,11 +26,28 @@ extern "C" { > * the linux timespec structure, which is dependent on POSIX extension level. > */ > typedef struct odp_time_t { > - int64_t tv_sec; /**< @internal Seconds */ > - int64_t tv_nsec; /**< @internal Nanoseconds */ > + union { > + /** @internal Posix timespec */ > + struct { > + /** @internal Seconds */ > + int64_t tv_sec; > + > + /** @internal Nanoseconds */ > + int64_t tv_nsec; > + } spec; > + > + /** @internal HW time counter */ > + struct { > + /** @internal Counter value */ > + uint64_t count; > + > + /** @internal Reserved */ > + uint64_t res; > + } hw; > + }; > } odp_time_t; > > -#define ODP_TIME_NULL ((odp_time_t){0, 0}) > +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} }) > > /** > * @} > diff --git a/platform/linux-generic/include/odp_time_internal.h b/platform/linux-generic/include/odp_time_internal.h > new file mode 100644 > index 00000000..99ac7977 > --- /dev/null > +++ b/platform/linux-generic/include/odp_time_internal.h > @@ -0,0 +1,24 @@ > +/* Copyright (c) 2017, Linaro Limited > + * All rights reserved. > + * > + * SPDX-License-Identifier: BSD-3-Clause > + */ > + > +#ifndef ODP_TIME_INTERNAL_H_ > +#define ODP_TIME_INTERNAL_H_ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include <stdint.h> > + > +int cpu_has_global_time(void); > +uint64_t cpu_global_time(void); > +uint64_t cpu_global_time_freq(void); > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif > diff --git a/platform/linux-generic/odp_time.c b/platform/linux-generic/odp_time.c > index 81e05224..2dd8f2c4 100644 > --- a/platform/linux-generic/odp_time.c > +++ b/platform/linux-generic/odp_time.c > @@ -10,36 +10,39 @@ > #include <odp/api/time.h> > #include <odp/api/hints.h> > #include <odp_debug_internal.h> > +#include <odp_time_internal.h> > +#include <string.h> > +#include <inttypes.h> > > -static odp_time_t start_time; > +typedef struct time_global_t { > + odp_time_t start_time; > + int use_hw; > + uint64_t hw_start; > + uint64_t hw_freq_hz; > +} time_global_t; > > -static inline > -uint64_t time_to_ns(odp_time_t time) > -{ > - uint64_t ns; > - > - ns = time.tv_sec * ODP_TIME_SEC_IN_NS; > - ns += time.tv_nsec; > +static time_global_t global; > > - return ns; > -} > +/* > + * Posix timespec based functions > + */ > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) > { > odp_time_t time; > > - time.tv_sec = t2.tv_sec - t1.tv_sec; > - time.tv_nsec = t2.tv_nsec - t1.tv_nsec; > + time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec; > + time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec; > > - if (time.tv_nsec < 0) { > - time.tv_nsec += ODP_TIME_SEC_IN_NS; > - --time.tv_sec; > + if (time.spec.tv_nsec < 0) { > + time.spec.tv_nsec += ODP_TIME_SEC_IN_NS; > + --time.spec.tv_sec; > } > > return time; > } > > -static inline odp_time_t time_local(void) > +static inline odp_time_t time_spec_cur(void) > { > int ret; > odp_time_t time; > @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void) > if (odp_unlikely(ret != 0)) > ODP_ABORT("clock_gettime failed\n"); > > - time.tv_sec = sys_time.tv_sec; > - time.tv_nsec = sys_time.tv_nsec; > + time.spec.tv_sec = sys_time.tv_sec; > + time.spec.tv_nsec = sys_time.tv_nsec; > > - return time_diff(time, start_time); > + return time_spec_diff(time, global.start_time); > } > > -static inline int time_cmp(odp_time_t t2, odp_time_t t1) > +static inline uint64_t time_spec_res(void) > { > - if (t2.tv_sec < t1.tv_sec) > + int ret; > + struct timespec tres; > + > + ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > + if (odp_unlikely(ret != 0)) > + ODP_ABORT("clock_getres failed\n"); > + > + return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > +} > + > +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1) > +{ > + if (t2.spec.tv_sec < t1.spec.tv_sec) > return -1; > > - if (t2.tv_sec > t1.tv_sec) > + if (t2.spec.tv_sec > t1.spec.tv_sec) > return 1; > > - return t2.tv_nsec - t1.tv_nsec; > + return t2.spec.tv_nsec - t1.spec.tv_nsec; > } > > -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2) > { > odp_time_t time; > > - time.tv_sec = t2.tv_sec + t1.tv_sec; > - time.tv_nsec = t2.tv_nsec + t1.tv_nsec; > + time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec; > + time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec; > > - if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > - time.tv_nsec -= ODP_TIME_SEC_IN_NS; > - ++time.tv_sec; > + if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { > + time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS; > + ++time.spec.tv_sec; > } > > return time; > } > > -static inline odp_time_t time_local_from_ns(uint64_t ns) > +static inline uint64_t time_spec_to_ns(odp_time_t time) > +{ > + uint64_t ns; > + > + ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > + ns += time.spec.tv_nsec; > + > + return ns; > +} > + > +static inline odp_time_t time_spec_from_ns(uint64_t ns) > { > odp_time_t time; > > - time.tv_sec = ns / ODP_TIME_SEC_IN_NS; > - time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS; > + time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS; > + time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS; > > return time; > } > > -static inline void time_wait_until(odp_time_t time) > +/* > + * HW time counter based functions > + */ > + > +static inline odp_time_t time_hw_cur(void) > { > - odp_time_t cur; > + odp_time_t time; > > - do { > - cur = time_local(); > - } while (time_cmp(time, cur) > 0); > + time.hw.res = 0; > + time.hw.count = cpu_global_time() - global.hw_start; > + > + return time; > } > > -static inline uint64_t time_local_res(void) > +static inline uint64_t time_hw_res(void) > { > - int ret; > - struct timespec tres; > + /* Promise a bit lower resolution than average cycle counter > + * frequency */ > + return global.hw_freq_hz / 10; > +} > > - ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); > - if (odp_unlikely(ret != 0)) > - ODP_ABORT("clock_getres failed\n"); > +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1) > +{ > + if (odp_likely(t2.hw.count > t1.hw.count)) > + return 1; > > - return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; > + if (t2.hw.count < t1.hw.count) > + return -1; > + > + return 0; > +} > + > +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1) > +{ > + odp_time_t time; > + > + time.hw.res = 0; > + time.hw.count = t2.hw.count - t1.hw.count; > + > + return time; > +} > + > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > +{ > + odp_time_t time; > + > + time.hw.res = 0; > + time.hw.count = t1.hw.count + t2.hw.count; > + > + return time; > +} > + > +static inline uint64_t time_hw_to_ns(odp_time_t time) > +{ > + uint64_t nsec; > + uint64_t freq_hz = global.hw_freq_hz; > + uint64_t count = time.hw.count; > + uint64_t sec = 0; > + > + if (count >= freq_hz) { > + sec = count / freq_hz; > + count = count - sec * freq_hz; > + } > + > + nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz; > + > + return (sec * ODP_TIME_SEC_IN_NS) + nsec; > +} > + > +static inline odp_time_t time_hw_from_ns(uint64_t ns) > +{ > + odp_time_t time; > + uint64_t count; > + uint64_t freq_hz = global.hw_freq_hz; > + uint64_t sec = 0; > + > + if (ns >= ODP_TIME_SEC_IN_NS) { > + sec = ns / ODP_TIME_SEC_IN_NS; > + ns = ns - sec * ODP_TIME_SEC_IN_NS; > + } > + > + count = sec * freq_hz; > + count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS; > + > + time.hw.res = 0; > + time.hw.count = count; > + > + return time; > +} > + > +/* > + * Common functions > + */ > + > +static inline odp_time_t time_cur(void) > +{ > + if (global.use_hw) > + return time_hw_cur(); > + > + return time_spec_cur(); > +} > + > +static inline uint64_t time_res(void) > +{ > + if (global.use_hw) > + return time_hw_res(); > + > + return time_spec_res(); > +} > + > +static inline int time_cmp(odp_time_t t2, odp_time_t t1) > +{ > + if (global.use_hw) > + return time_hw_cmp(t2, t1); > + > + return time_spec_cmp(t2, t1); > +} > + > +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) > +{ > + if (global.use_hw) > + return time_hw_diff(t2, t1); > + > + return time_spec_diff(t2, t1); > +} > + > +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) > +{ > + if (global.use_hw) > + return time_hw_sum(t1, t2); > + > + return time_spec_sum(t1, t2); > +} > + > +static inline uint64_t time_to_ns(odp_time_t time) > +{ > + if (global.use_hw) > + return time_hw_to_ns(time); > + > + return time_spec_to_ns(time); > +} > + > +static inline odp_time_t time_from_ns(uint64_t ns) > +{ > + if (global.use_hw) > + return time_hw_from_ns(ns); > + > + return time_spec_from_ns(ns); > +} > + > +static inline void time_wait_until(odp_time_t time) > +{ > + odp_time_t cur; > + > + do { > + cur = time_cur(); > + } while (time_cmp(time, cur) > 0); > } > > odp_time_t odp_time_local(void) > { > - return time_local(); > + return time_cur(); > } > > odp_time_t odp_time_global(void) > { > - return time_local(); > + return time_cur(); > } > > odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1) > @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time) > > odp_time_t odp_time_local_from_ns(uint64_t ns) > { > - return time_local_from_ns(ns); > + return time_from_ns(ns); > } > > odp_time_t odp_time_global_from_ns(uint64_t ns) > { > - return time_local_from_ns(ns); > + return time_from_ns(ns); > } > > int odp_time_cmp(odp_time_t t2, odp_time_t t1) > @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t t2) > > uint64_t odp_time_local_res(void) > { > - return time_local_res(); > + return time_res(); > } > > uint64_t odp_time_global_res(void) > { > - return time_local_res(); > + return time_res(); > } > > void odp_time_wait_ns(uint64_t ns) > { > - odp_time_t cur = time_local(); > - odp_time_t wait = time_local_from_ns(ns); > + odp_time_t cur = time_cur(); > + odp_time_t wait = time_from_ns(ns); > odp_time_t end_time = time_sum(cur, wait); > > time_wait_until(end_time); > @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time) > > int odp_time_init_global(void) > { > - int ret; > - struct timespec time; > - > - ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time); > - if (ret) { > - start_time = ODP_TIME_NULL; > - } else { > - start_time.tv_sec = time.tv_sec; > - start_time.tv_nsec = time.tv_nsec; > + struct timespec sys_time; > + int ret = 0; > + > + memset(&global, 0, sizeof(time_global_t)); > + > + if (cpu_has_global_time()) { > + global.use_hw = 1; > + global.hw_freq_hz = cpu_global_time_freq(); > + > + if (global.hw_freq_hz == 0) > + return -1; > + > + printf("HW time counter freq: %" PRIu64 " hz\n\n", > + global.hw_freq_hz); > + > + global.hw_start = cpu_global_time(); > + return 0; > + } > + > + global.start_time = ODP_TIME_NULL; > + > + ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time); > + if (ret == 0) { > + global.start_time.spec.tv_sec = sys_time.tv_sec; > + global.start_time.spec.tv_nsec = sys_time.tv_nsec; > } > > return ret; > -- > 2.11.0 >
> > From coverletter: > > "This patch set modifies time implementation to use TSC when running on > a x86 > > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system > time > > is used as before. TSC is much more efficient both in performance and > > latency/jitter wise than Linux system call. This can be seen also with > > scheduler latency test which time stamps events with this API. All > latency > > measurements (min, ave, max) improved significantly." > > odp_sched_latency currently uses clock_gettime. It is my understanding > that clock_gettime does not have the over head of the system call. Can > you elaborate more on the 'improved significantly' part? > clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of kernel functions including system call entry, RCU maintenance, etc. E.g. in sched_latency test kernel consumed about 10% of all the cycles. Also latency measurement results improved like this: * min >3x lower * avg 2x lower * max more stable and 50% lower -Petri
> > > This function (cpu_global_time()) is called only when we have first > checked that TSC is invariant. Also we measure the TSC frequency in that > case. This function is defined in the same file as cpu_cycles(), and the > file is x86 specific. So, we know what we are doing, and just re-using the > code to read TSC. > > What sort of timing accuracy is expected from the app? > > From benchmarking the maximum single-threaded rate of these reads: > > x86_64: > > read 7 ns/op > read_sync 22 ns/op > > A57: > > read 4 ns/op > read_sync 26 ns/op > > read_sync issues a synchronizing instruction for greater timing accuracy > but clearly takes more time to return the time value read from the core. Accuracy is as good as implementation can offer with reasonable overhead. We do not put any nsec figures into API spec. ODP API should offer application the most efficient way to read time anyway. This patch does not take a position which way TSC should be read. There are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp is not as widely supported as rdtsc. This detail is a magnitude less significant compared to: use system call vs direct TSC read. It can be tuned later. This patch set helps if rdtscp should be used later on (introduces x86 cpu flags). -Petri
> > > > diff --git a/platform/linux- > generic/include/odp/api/plat/time_types.h > > > b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > index 4847f3b1..8ae7ce82 100644 > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > @@ -26,11 +26,28 @@ extern "C" { > > > > * the linux timespec structure, which is dependent on POSIX > extension > > > level. > > > > */ > > > > typedef struct odp_time_t { > > > > - int64_t tv_sec; /**< @internal Seconds */ > > > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > > > + union { > > > > + /** @internal Posix timespec */ > > > > + struct { > > > > + /** @internal Seconds */ > > > > + int64_t tv_sec; > > > > + > > > > + /** @internal Nanoseconds */ > > > > + int64_t tv_nsec; > > > > + } spec; > > > > + > > > > + /** @internal HW time counter */ > > > > + struct { > > > > + /** @internal Counter value */ > > > > + uint64_t count; > > > > + > > > > + /** @internal Reserved */ > > > > + uint64_t res; > > > > + } hw; > > > > + }; > > > > > > A processor's tick counter cannot be used to measure calendar time by > > > itself. The delta between two ticks, however, can be converted to > > > calendar time. > > > > > > Please see the proposal that introduces odp_tick_t which eliminates > > > the wasted u64 reserved field in the union above. > > > > > > Nothing is wasted here today. > > The 64-bit CPU time is stored in a 128-bit data structure where 64-bits > are wasted. You can use just a 64-bit type and then convert that to > a timespec (using a starting timestamp taken on global init) if needed. > Nothing is wasted compared to the situation today. Entire timespec is stored as before. If you want to compress timespec, it's another topic. Compress means additional complexity and overhead. This patch set just adds ability to use 64 but HW time when available. Timespec side implementation remains as is. > > This is because all our code (in this and other files) does "static > inline" for functions that we want to be inlined. Static is used for > functions that are static, but we don't care if those are inlined (slow > path). Most time API functions are fast path. > > I am saying that adding the 'inline' qualifier here doesn't actually do > anything > because the compiler will consider static functions for inlining anyways. We all know that those are considered. We use "inline" as C standard describes: to suggest which function calls should be as fast as possible. > > > > + > > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) > > > > +{ > > > > + odp_time_t time; > > > > + > > > > + time.hw.res = 0; > > > > + time.hw.count = t1.hw.count + t2.hw.count; > > > > + > > > > + return time; > > > > +} > > > > > > If a single u64 were used this code wouldn't need to exist. > > > > Which code? hw.res = 0 ? It not actually used for anything and could be > removed. Although, the performance gain would not be huge. Anyway, I'll > remove it for v2. > > If the CPU time was stored in a 64-bit type, you can use arithmetic > operators > on the values directly instead of doing arithmetic on a 128-bit compound > datatype where 64-bits are unused. This is the union above. The implementation above does 64bit arithmetic. The reserved field is not used for anything. V2 removed all .res references (except one due to a false GCC warning). We have abstract API definition, so that application remains portable also, when time cannot be represented with a single integer type. It's implementation defined, if e.g. the above sum function is inlined and thus results no overhead at all. -Petri
> > +#define SEC_IN_NS 1000000000ULL > > + > > +/* Measure TSC frequency. Frequency information registers are defined > for x86, > > + * but those are often not enumerated. */ > > +uint64_t cpu_global_time_freq(void) > > +{ > > + struct timespec sleep, ts1, ts2; > > + uint64_t t1, t2, ts_nsec, cycles, hz; > > + int i; > > + uint64_t avg = 0; > > + int rounds = 4; > > + > > + for (i = 0; i < rounds; i++) { > > + sleep.tv_sec = 0; > > + sleep.tv_nsec = SEC_IN_NS / 10; > > + > > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) { > > + ODP_DBG("clock_gettime failed\n"); > > + return 0; > > + } > > + > > + t1 = cpu_global_time(); > > + > > + if (nanosleep(&sleep, NULL) < 0) { > > + ODP_DBG("nanosleep failed\n"); > > + return 0; > > + } > > + > > + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) { > > + ODP_DBG("clock_gettime failed\n"); > > + return 0; > > + } > > + > > + t2 = cpu_global_time(); > > + > > + ts_nsec = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS; > > + ts_nsec += ts2.tv_nsec - ts1.tv_nsec; > > + > > + cycles = t2 - t1; > > + > > + hz = (cycles * SEC_IN_NS) / ts_nsec; > > + avg += hz; > > + } > > + > > + return avg / rounds; > > +} > > This function is not very accurate. Ideally, ts1 and t1 (ts2 and t2) > should be read at the same instance for this to be accurate. There is > also the possibility that the 'rdtsc' in cpu_global_time might get > executed ahead or later than what is in the code here. > > Since this is called during init, the 'rounds' can be increased to a > higher value to get a better average. Initial values can be discarded > to ignore the cache warming latencies. > > We should fall back to this only if the frequency information > registers are not available. > > There is a good white paper on methods to use for measuring cycles for > code that takes small amount of cycles. > http://www.intel.com/content/dam/www/public/us/en/documents/white- > papers/ia-32-ia-64-benchmark-code-execution-paper.pdf > Execution or latency does not affect accuracy here since we measure over 100ms (about 200 M CPU cycles). E.g. a time stamp error of 1000 cycles would result an error in the 6th digit of the result, which is not very significant. DPDK uses similar algorithm to measure the frequency. This measurement takes now 0.4 seconds in total. I would not want to consume too many seconds on every application start up for this. E.g. on my Haswell those registers are not populated. Also DPDK does not refer to those anywhere. So, don't expect those to be populated often. I think this should be accurate enough as is. It can be optimized later if needed. -Petri
> > odp_sched_latency currently uses clock_gettime. It is my understanding > > that clock_gettime does not have the over head of the system call. Can > > you elaborate more on the 'improved significantly' part? > > > > clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of > kernel functions including system call entry, RCU maintenance, etc. clock_gettime() does not use the vdso implementation without syscall overhead on x86 if clock id is CLOCK_MONOTONIC_RAW as it seems to be in ODP. I think new enough kernels do support CLOCK_MONOTONIC_RAW in vsdo for arm64 though. CLOCK_MONOTONIC is supported in vdso in x86 and would not cause syscall overhead provided that the kernel time source is tsc (which it often is, but not always (e.g. in some VMs)). Janne
On 04/26/17 12:40, Peltonen, Janne (Nokia - FI/Espoo) wrote: >>> odp_sched_latency currently uses clock_gettime. It is my understanding >>> that clock_gettime does not have the over head of the system call. Can >>> you elaborate more on the 'improved significantly' part? >>> >> >> clock_gettime() uses the same TSC, but when you profile it with perf you can see tens of >> kernel functions including system call entry, RCU maintenance, etc. > > clock_gettime() does not use the vdso implementation without syscall overhead > on x86 if clock id is CLOCK_MONOTONIC_RAW as it seems to be in ODP. I think > new enough kernels do support CLOCK_MONOTONIC_RAW in vsdo for arm64 though. > > CLOCK_MONOTONIC is supported in vdso in x86 and would not cause syscall > overhead provided that the kernel time source is tsc (which it often is, > but not always (e.g. in some VMs)). > > Janne > > here we need to be very careful with 2 things: 1) if api says nanosecond should be returned than it has to be nanoseconds. 2) We need to go in more generic way. If on fresh kernels clock_gettime() shows great results then maybe it's reasonable to say with it. But in that case we need to do measurements and define minimal kernel version. I think that on any call kernel does some internal time store which might trigger other sybsystem to take an action (soft irq timers and rcu run as Petri saw). This thread looks like very old but it says that CLOCK_MONOTONIC_RAW has worst performance according to rdtsc: http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/ ps. link also has test code to measure values. Maxim.
> -----Original Message----- > From: Brian Brooks [mailto:brian.brooks@arm.com] > Sent: Wednesday, April 19, 2017 9:37 PM > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com> > Cc: Honnappa Nagarahalli <honnappa.nagarahalli@linaro.org>; lng- > odp@lists.linaro.org > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time > counter when available > > On 04/26 07:11:57, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > > > > From coverletter: > > > > "This patch set modifies time implementation to use TSC when running > on > > > a x86 > > > > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux > system > > > time > > > > is used as before. TSC is much more efficient both in performance > and > > > > latency/jitter wise than Linux system call. This can be seen also > with > > > > scheduler latency test which time stamps events with this API. All > > > latency > > > > measurements (min, ave, max) improved significantly." > > > > > > odp_sched_latency currently uses clock_gettime. It is my understanding > > > that clock_gettime does not have the over head of the system call. Can > > > you elaborate more on the 'improved significantly' part? > > > > > > > clock_gettime() uses the same TSC, but when you profile it with perf you > can see tens of kernel functions including system call entry, RCU > maintenance, etc. > > > > E.g. in sched_latency test kernel consumed about 10% of all the cycles. > Also latency measurement results improved like this: > > * min >3x lower > > * avg 2x lower > > * max more stable and 50% lower > > You might want to share more information on the environment > where you're seeing such significant improvements because the > results on Broadwell do not match the above interpretation. > > PS - This patch series breaks the build on ARM. > Use v2. It should build on ARM. -Petri
> -----Original Message----- > From: Brian Brooks [mailto:brian.brooks@arm.com] > Sent: Wednesday, April 19, 2017 9:46 PM > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com> > Cc: lng-odp@lists.linaro.org > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time > counter when available > > On 04/26 07:30:15, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > > > This function (cpu_global_time()) is called only when we have > first > > > checked that TSC is invariant. Also we measure the TSC frequency in > that > > > case. This function is defined in the same file as cpu_cycles(), and > the > > > file is x86 specific. So, we know what we are doing, and just re-using > the > > > code to read TSC. > > > > > > What sort of timing accuracy is expected from the app? > > > > > > From benchmarking the maximum single-threaded rate of these reads: > > > > > > x86_64: > > > > > > read 7 ns/op > > > read_sync 22 ns/op > > > > > > A57: > > > > > > read 4 ns/op > > > read_sync 26 ns/op > > > > > > read_sync issues a synchronizing instruction for greater timing > accuracy > > > but clearly takes more time to return the time value read from the > core. > > > > Accuracy is as good as implementation can offer with reasonable > overhead. We do not put any nsec figures into API spec. ODP API should > offer application the most efficient way to read time anyway. > > 'reasonable' is what we need to define. > > Another reason why you're seeing a performance boost on x86 is that when > switching from clock_gettime() to RDTSC, you're no longer issuing a > synchronizing > instruction (fence). As shown above, this can be a significant factor > depending > on how often the time is being sampled. > > However, there is a loss in timing accuracy because the load of the value > may not happen at the time it happens in program order. This is why a > synchronizing instruction needs to be used, but it slows down the > execution > of the thread on the core... > > > This patch does not take a position which way TSC should be read. There > are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current > code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp > is not as widely supported as rdtsc. This detail is a magnitude less > significant compared to: use system call vs direct TSC read. It can be > tuned later. This patch set helps if rdtscp should be used later on > (introduces x86 cpu flags). > > So you're saying that you do not need the synchronizing instruction, and > the > loss of timing accuracy is OK, right? What is our timing accuracy today? How much jitter the system call (and everything may launch) causes in the current implementation? How much we are losing accuracy compared what we have today? I'd say we are better off that today anyway, just because we avoid the system call. We don't have fence today on rdtsc read for cycle count. This patch does not add it, either. If we are good without it for cycles, we are good for nsec also. The scale time scale application typically measures is in order of micro to milliseconds - thousands to millions of nanoseconds. If out of order execution moves the sample position e.g. 20 nsec (40 CPU cycles), it is not significant error to the measurement. Already single cache miss during a measurement causes same kind of noise to the measurement. Also CPU crystal might not be too accurate either, etc. -Petri
> -----Original Message----- > From: Brian Brooks [mailto:brian.brooks@arm.com] > Sent: Wednesday, April 19, 2017 10:05 PM > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia.com> > Cc: lng-odp@lists.linaro.org > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time > counter when available > > On 04/26 08:02:09, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > > > > > > diff --git a/platform/linux- > > > generic/include/odp/api/plat/time_types.h > > > > > b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > > index 4847f3b1..8ae7ce82 100644 > > > > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h > > > > > > @@ -26,11 +26,28 @@ extern "C" { > > > > > > * the linux timespec structure, which is dependent on POSIX > > > extension > > > > > level. > > > > > > */ > > > > > > typedef struct odp_time_t { > > > > > > - int64_t tv_sec; /**< @internal Seconds */ > > > > > > - int64_t tv_nsec; /**< @internal Nanoseconds */ > > > > > > + union { > > > > > > + /** @internal Posix timespec */ > > > > > > + struct { > > > > > > + /** @internal Seconds */ > > > > > > + int64_t tv_sec; > > > > > > + > > > > > > + /** @internal Nanoseconds */ > > > > > > + int64_t tv_nsec; > > > > > > + } spec; > > > > > > + > > > > > > + /** @internal HW time counter */ > > > > > > + struct { > > > > > > + /** @internal Counter value */ > > > > > > + uint64_t count; > > > > > > + > > > > > > + /** @internal Reserved */ > > > > > > + uint64_t res; > > > > > > + } hw; > > > > > > + }; > > > > > > > > > > A processor's tick counter cannot be used to measure calendar time > by > > > > > itself. The delta between two ticks, however, can be converted to > > > > > calendar time. > > > > > > > > > > Please see the proposal that introduces odp_tick_t which > eliminates > > > > > the wasted u64 reserved field in the union above. > > > > > > > > > > > > Nothing is wasted here today. > > > > > > The 64-bit CPU time is stored in a 128-bit data structure where 64- > bits > > > are wasted. You can use just a 64-bit type and then convert that to > > > a timespec (using a starting timestamp taken on global init) if > needed. > > > > > > > Nothing is wasted compared to the situation today. Entire timespec is > stored as before. If you want to compress timespec, it's another topic. > Compress means additional complexity and overhead. This patch set just > adds ability to use 64 but HW time when available. Timespec side > implementation remains as is. > > You are re-using the calendar time type to store a value read from the CPU > counter. A different approach is to use a different type (64 bits only for > no > wasted space) for the value read from the CPU. > > This value must be converted to a calendar time time if needed. It cannot > just > be used to represent calendar time. > > However, for some applications and use cases, you don't need to convert to > calendar time > in order to measure across two reads of the counter. You can also do > direct arithmetic > on the value instead of arithmetic on a timespec. > > If you want to read code instead of email or design documentation, look at > this: > > > https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_ben > ch.c > https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time.h > > All you need is a 64 bit type, a way to read from the counter as well as > get > the frequency, and 2 very small conversion functions. First, odp_time_t is our storage type for time values. Today odp-linux defines that as 2x64bits. Nobody has complained about too high time storage consumption. I'm not changing size of the type here. It's not purpose of this patch set to e.g. compress timespec and thus minimize memory foot print of time storage. Second, are you proposing an API change? Change of time type definition? Ola's cntr_now() function changes as soon as your time source is not a 64bit counter, but e.g. timespec (2x64bit). It would need to compress that. How application would do arithmetic on the compressed timespec ? Third, compression/conversion usually mean division operation. Division is tens of times heavier operation than sum, dec, mult. Currently, ODP API is defined so that conversion is done only when application asks for it (odp_time_t <-> nsec time). Compression is a trade-off between storage space size and performance. -Petri
> You multiply seconds by 1,000,000,000 and add the nanoseconds to get the > epoch > offset in terms of nanoseconds - this can be stored in a uint64_t. And, > now you > can "spec" that time is in terms of nanoseconds not "implementation > defined". > The application can now do arithmetic too. > > > Third, compression/conversion usually mean division operation. Division > is tens of times heavier operation than sum, dec, mult. Currently, ODP API > is defined so that conversion is done only when application asks for it > (odp_time_t <-> nsec time). Compression is a trade-off between storage > space size and performance. > > I prefer to measure. Perhaps time operations using CPU tick masked by a > timespec > should be added here: > > > https://git.linaro.org/people/ola.liljedahl/sw_scheduler.git/tree/time_ben > ch.c > > Taking the difference between two 64-bit values and calling > nsec_from_cntr() is > quite fast - as fast (faster on ARM) as calling clock_gettime() once > through the vDSO. > Realized after sending yesterday that timespec is pretty fast to convert into nsec. So, I did add that as the last patch of v3. So, size of odp_time_t (in odp-linux) is 64 bits after this set. Hopefully, everybody are happy with this now. I'd really like to work one change at the time: first take HW time counter into use, then optimize timespec implementation, etc. One problem, one patch set. -Petri
diff --git a/platform/linux-generic/Makefile.am b/platform/linux-generic/Makefile.am index 60b7f849..ed66fecf 100644 --- a/platform/linux-generic/Makefile.am +++ b/platform/linux-generic/Makefile.am @@ -171,6 +171,7 @@ noinst_HEADERS = \ ${srcdir}/include/odp_schedule_if.h \ ${srcdir}/include/odp_sorted_list_internal.h \ ${srcdir}/include/odp_shm_internal.h \ + ${srcdir}/include/odp_time_internal.h \ ${srcdir}/include/odp_timer_internal.h \ ${srcdir}/include/odp_timer_wheel_internal.h \ ${srcdir}/include/odp_traffic_mngr_internal.h \ diff --git a/platform/linux-generic/arch/arm/odp_cpu_arch.c b/platform/linux-generic/arch/arm/odp_cpu_arch.c index 2ac223e0..3a87f09c 100644 --- a/platform/linux-generic/arch/arm/odp_cpu_arch.c +++ b/platform/linux-generic/arch/arm/odp_cpu_arch.c @@ -13,6 +13,7 @@ #include <odp/api/hints.h> #include <odp/api/system_info.h> #include <odp_debug_internal.h> +#include <odp_time_internal.h> #define GIGA 1000000000 @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) { return 1; } + +uint64_t cpu_global_time(void) +{ + return 0; +} + +uint64_t cpu_global_time_freq(void) +{ + return 0; +} diff --git a/platform/linux-generic/arch/default/odp_cpu_arch.c b/platform/linux-generic/arch/default/odp_cpu_arch.c index 2ac223e0..3a87f09c 100644 --- a/platform/linux-generic/arch/default/odp_cpu_arch.c +++ b/platform/linux-generic/arch/default/odp_cpu_arch.c @@ -13,6 +13,7 @@ #include <odp/api/hints.h> #include <odp/api/system_info.h> #include <odp_debug_internal.h> +#include <odp_time_internal.h> #define GIGA 1000000000 @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) { return 1; } + +uint64_t cpu_global_time(void) +{ + return 0; +} + +uint64_t cpu_global_time_freq(void) +{ + return 0; +} diff --git a/platform/linux-generic/arch/mips64/odp_cpu_arch.c b/platform/linux-generic/arch/mips64/odp_cpu_arch.c index 646acf9c..a9a94531 100644 --- a/platform/linux-generic/arch/mips64/odp_cpu_arch.c +++ b/platform/linux-generic/arch/mips64/odp_cpu_arch.c @@ -7,6 +7,7 @@ #include <odp/api/cpu.h> #include <odp/api/hints.h> #include <odp/api/system_info.h> +#include <odp_time_internal.h> uint64_t odp_cpu_cycles(void) { @@ -29,3 +30,13 @@ uint64_t odp_cpu_cycles_resolution(void) { return 1; } + +uint64_t cpu_global_time(void) +{ + return 0; +} + +uint64_t cpu_global_time_freq(void) +{ + return 0; +} diff --git a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c index 2ac223e0..3a87f09c 100644 --- a/platform/linux-generic/arch/powerpc/odp_cpu_arch.c +++ b/platform/linux-generic/arch/powerpc/odp_cpu_arch.c @@ -13,6 +13,7 @@ #include <odp/api/hints.h> #include <odp/api/system_info.h> #include <odp_debug_internal.h> +#include <odp_time_internal.h> #define GIGA 1000000000 @@ -46,3 +47,13 @@ uint64_t odp_cpu_cycles_resolution(void) { return 1; } + +uint64_t cpu_global_time(void) +{ + return 0; +} + +uint64_t cpu_global_time_freq(void) +{ + return 0; +} diff --git a/platform/linux-generic/arch/x86/cpu_flags.c b/platform/linux-generic/arch/x86/cpu_flags.c index 8fb9477a..cde8ad3e 100644 --- a/platform/linux-generic/arch/x86/cpu_flags.c +++ b/platform/linux-generic/arch/x86/cpu_flags.c @@ -38,6 +38,7 @@ */ #include <arch/x86/cpu_flags.h> +#include <odp_time_internal.h> #include <stdio.h> #include <stdint.h> @@ -347,3 +348,11 @@ void cpu_flags_print_all(void) printf("\n\n"); } + +int cpu_has_global_time(void) +{ + if (cpu_get_flag_enabled(RTE_CPUFLAG_INVTSC) > 0) + return 1; + + return 0; +} diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c b/platform/linux-generic/arch/x86/odp_cpu_arch.c index c8cf27b6..9ba601a3 100644 --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c @@ -3,7 +3,14 @@ * * SPDX-License-Identifier: BSD-3-Clause */ + +#include <odp_posix_extensions.h> + #include <odp/api/cpu.h> +#include <odp_time_internal.h> +#include <odp_debug_internal.h> + +#include <time.h> uint64_t odp_cpu_cycles(void) { @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void) { return 1; } + +uint64_t cpu_global_time(void) +{ + return odp_cpu_cycles(); +} + +#define SEC_IN_NS 1000000000ULL + +/* Measure TSC frequency. Frequency information registers are defined for x86, + * but those are often not enumerated. */ +uint64_t cpu_global_time_freq(void) +{ + struct timespec sleep, ts1, ts2; + uint64_t t1, t2, ts_nsec, cycles, hz; + int i; + uint64_t avg = 0; + int rounds = 4; + + for (i = 0; i < rounds; i++) { + sleep.tv_sec = 0; + sleep.tv_nsec = SEC_IN_NS / 10; + + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts1)) { + ODP_DBG("clock_gettime failed\n"); + return 0; + } + + t1 = cpu_global_time(); + + if (nanosleep(&sleep, NULL) < 0) { + ODP_DBG("nanosleep failed\n"); + return 0; + } + + if (clock_gettime(CLOCK_MONOTONIC_RAW, &ts2)) { + ODP_DBG("clock_gettime failed\n"); + return 0; + } + + t2 = cpu_global_time(); + + ts_nsec = (ts2.tv_sec - ts1.tv_sec) * SEC_IN_NS; + ts_nsec += ts2.tv_nsec - ts1.tv_nsec; + + cycles = t2 - t1; + + hz = (cycles * SEC_IN_NS) / ts_nsec; + avg += hz; + } + + return avg / rounds; +} diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h b/platform/linux-generic/include/odp/api/plat/time_types.h index 4847f3b1..8ae7ce82 100644 --- a/platform/linux-generic/include/odp/api/plat/time_types.h +++ b/platform/linux-generic/include/odp/api/plat/time_types.h @@ -26,11 +26,28 @@ extern "C" { * the linux timespec structure, which is dependent on POSIX extension level. */ typedef struct odp_time_t { - int64_t tv_sec; /**< @internal Seconds */ - int64_t tv_nsec; /**< @internal Nanoseconds */ + union { + /** @internal Posix timespec */ + struct { + /** @internal Seconds */ + int64_t tv_sec; + + /** @internal Nanoseconds */ + int64_t tv_nsec; + } spec; + + /** @internal HW time counter */ + struct { + /** @internal Counter value */ + uint64_t count; + + /** @internal Reserved */ + uint64_t res; + } hw; + }; } odp_time_t; -#define ODP_TIME_NULL ((odp_time_t){0, 0}) +#define ODP_TIME_NULL ((odp_time_t){.spec = {0, 0} }) /** * @} diff --git a/platform/linux-generic/include/odp_time_internal.h b/platform/linux-generic/include/odp_time_internal.h new file mode 100644 index 00000000..99ac7977 --- /dev/null +++ b/platform/linux-generic/include/odp_time_internal.h @@ -0,0 +1,24 @@ +/* Copyright (c) 2017, Linaro Limited + * All rights reserved. + * + * SPDX-License-Identifier: BSD-3-Clause + */ + +#ifndef ODP_TIME_INTERNAL_H_ +#define ODP_TIME_INTERNAL_H_ + +#ifdef __cplusplus +extern "C" { +#endif + +#include <stdint.h> + +int cpu_has_global_time(void); +uint64_t cpu_global_time(void); +uint64_t cpu_global_time_freq(void); + +#ifdef __cplusplus +} +#endif + +#endif diff --git a/platform/linux-generic/odp_time.c b/platform/linux-generic/odp_time.c index 81e05224..2dd8f2c4 100644 --- a/platform/linux-generic/odp_time.c +++ b/platform/linux-generic/odp_time.c @@ -10,36 +10,39 @@ #include <odp/api/time.h> #include <odp/api/hints.h> #include <odp_debug_internal.h> +#include <odp_time_internal.h> +#include <string.h> +#include <inttypes.h> -static odp_time_t start_time; +typedef struct time_global_t { + odp_time_t start_time; + int use_hw; + uint64_t hw_start; + uint64_t hw_freq_hz; +} time_global_t; -static inline -uint64_t time_to_ns(odp_time_t time) -{ - uint64_t ns; - - ns = time.tv_sec * ODP_TIME_SEC_IN_NS; - ns += time.tv_nsec; +static time_global_t global; - return ns; -} +/* + * Posix timespec based functions + */ -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1) { odp_time_t time; - time.tv_sec = t2.tv_sec - t1.tv_sec; - time.tv_nsec = t2.tv_nsec - t1.tv_nsec; + time.spec.tv_sec = t2.spec.tv_sec - t1.spec.tv_sec; + time.spec.tv_nsec = t2.spec.tv_nsec - t1.spec.tv_nsec; - if (time.tv_nsec < 0) { - time.tv_nsec += ODP_TIME_SEC_IN_NS; - --time.tv_sec; + if (time.spec.tv_nsec < 0) { + time.spec.tv_nsec += ODP_TIME_SEC_IN_NS; + --time.spec.tv_sec; } return time; } -static inline odp_time_t time_local(void) +static inline odp_time_t time_spec_cur(void) { int ret; odp_time_t time; @@ -49,77 +52,237 @@ static inline odp_time_t time_local(void) if (odp_unlikely(ret != 0)) ODP_ABORT("clock_gettime failed\n"); - time.tv_sec = sys_time.tv_sec; - time.tv_nsec = sys_time.tv_nsec; + time.spec.tv_sec = sys_time.tv_sec; + time.spec.tv_nsec = sys_time.tv_nsec; - return time_diff(time, start_time); + return time_spec_diff(time, global.start_time); } -static inline int time_cmp(odp_time_t t2, odp_time_t t1) +static inline uint64_t time_spec_res(void) { - if (t2.tv_sec < t1.tv_sec) + int ret; + struct timespec tres; + + ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); + if (odp_unlikely(ret != 0)) + ODP_ABORT("clock_getres failed\n"); + + return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; +} + +static inline int time_spec_cmp(odp_time_t t2, odp_time_t t1) +{ + if (t2.spec.tv_sec < t1.spec.tv_sec) return -1; - if (t2.tv_sec > t1.tv_sec) + if (t2.spec.tv_sec > t1.spec.tv_sec) return 1; - return t2.tv_nsec - t1.tv_nsec; + return t2.spec.tv_nsec - t1.spec.tv_nsec; } -static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) +static inline odp_time_t time_spec_sum(odp_time_t t1, odp_time_t t2) { odp_time_t time; - time.tv_sec = t2.tv_sec + t1.tv_sec; - time.tv_nsec = t2.tv_nsec + t1.tv_nsec; + time.spec.tv_sec = t2.spec.tv_sec + t1.spec.tv_sec; + time.spec.tv_nsec = t2.spec.tv_nsec + t1.spec.tv_nsec; - if (time.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { - time.tv_nsec -= ODP_TIME_SEC_IN_NS; - ++time.tv_sec; + if (time.spec.tv_nsec >= (long)ODP_TIME_SEC_IN_NS) { + time.spec.tv_nsec -= ODP_TIME_SEC_IN_NS; + ++time.spec.tv_sec; } return time; } -static inline odp_time_t time_local_from_ns(uint64_t ns) +static inline uint64_t time_spec_to_ns(odp_time_t time) +{ + uint64_t ns; + + ns = time.spec.tv_sec * ODP_TIME_SEC_IN_NS; + ns += time.spec.tv_nsec; + + return ns; +} + +static inline odp_time_t time_spec_from_ns(uint64_t ns) { odp_time_t time; - time.tv_sec = ns / ODP_TIME_SEC_IN_NS; - time.tv_nsec = ns - time.tv_sec * ODP_TIME_SEC_IN_NS; + time.spec.tv_sec = ns / ODP_TIME_SEC_IN_NS; + time.spec.tv_nsec = ns - time.spec.tv_sec * ODP_TIME_SEC_IN_NS; return time; } -static inline void time_wait_until(odp_time_t time) +/* + * HW time counter based functions + */ + +static inline odp_time_t time_hw_cur(void) { - odp_time_t cur; + odp_time_t time; - do { - cur = time_local(); - } while (time_cmp(time, cur) > 0); + time.hw.res = 0; + time.hw.count = cpu_global_time() - global.hw_start; + + return time; } -static inline uint64_t time_local_res(void) +static inline uint64_t time_hw_res(void) { - int ret; - struct timespec tres; + /* Promise a bit lower resolution than average cycle counter + * frequency */ + return global.hw_freq_hz / 10; +} - ret = clock_getres(CLOCK_MONOTONIC_RAW, &tres); - if (odp_unlikely(ret != 0)) - ODP_ABORT("clock_getres failed\n"); +static inline int time_hw_cmp(odp_time_t t2, odp_time_t t1) +{ + if (odp_likely(t2.hw.count > t1.hw.count)) + return 1; - return ODP_TIME_SEC_IN_NS / (uint64_t)tres.tv_nsec; + if (t2.hw.count < t1.hw.count) + return -1; + + return 0; +} + +static inline odp_time_t time_hw_diff(odp_time_t t2, odp_time_t t1) +{ + odp_time_t time; + + time.hw.res = 0; + time.hw.count = t2.hw.count - t1.hw.count; + + return time; +} + +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2) +{ + odp_time_t time; + + time.hw.res = 0; + time.hw.count = t1.hw.count + t2.hw.count; + + return time; +} + +static inline uint64_t time_hw_to_ns(odp_time_t time) +{ + uint64_t nsec; + uint64_t freq_hz = global.hw_freq_hz; + uint64_t count = time.hw.count; + uint64_t sec = 0; + + if (count >= freq_hz) { + sec = count / freq_hz; + count = count - sec * freq_hz; + } + + nsec = (ODP_TIME_SEC_IN_NS * count) / freq_hz; + + return (sec * ODP_TIME_SEC_IN_NS) + nsec; +} + +static inline odp_time_t time_hw_from_ns(uint64_t ns) +{ + odp_time_t time; + uint64_t count; + uint64_t freq_hz = global.hw_freq_hz; + uint64_t sec = 0; + + if (ns >= ODP_TIME_SEC_IN_NS) { + sec = ns / ODP_TIME_SEC_IN_NS; + ns = ns - sec * ODP_TIME_SEC_IN_NS; + } + + count = sec * freq_hz; + count += (ns * freq_hz) / ODP_TIME_SEC_IN_NS; + + time.hw.res = 0; + time.hw.count = count; + + return time; +} + +/* + * Common functions + */ + +static inline odp_time_t time_cur(void) +{ + if (global.use_hw) + return time_hw_cur(); + + return time_spec_cur(); +} + +static inline uint64_t time_res(void) +{ + if (global.use_hw) + return time_hw_res(); + + return time_spec_res(); +} + +static inline int time_cmp(odp_time_t t2, odp_time_t t1) +{ + if (global.use_hw) + return time_hw_cmp(t2, t1); + + return time_spec_cmp(t2, t1); +} + +static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1) +{ + if (global.use_hw) + return time_hw_diff(t2, t1); + + return time_spec_diff(t2, t1); +} + +static inline odp_time_t time_sum(odp_time_t t1, odp_time_t t2) +{ + if (global.use_hw) + return time_hw_sum(t1, t2); + + return time_spec_sum(t1, t2); +} + +static inline uint64_t time_to_ns(odp_time_t time) +{ + if (global.use_hw) + return time_hw_to_ns(time); + + return time_spec_to_ns(time); +} + +static inline odp_time_t time_from_ns(uint64_t ns) +{ + if (global.use_hw) + return time_hw_from_ns(ns); + + return time_spec_from_ns(ns); +} + +static inline void time_wait_until(odp_time_t time) +{ + odp_time_t cur; + + do { + cur = time_cur(); + } while (time_cmp(time, cur) > 0); } odp_time_t odp_time_local(void) { - return time_local(); + return time_cur(); } odp_time_t odp_time_global(void) { - return time_local(); + return time_cur(); } odp_time_t odp_time_diff(odp_time_t t2, odp_time_t t1) @@ -134,12 +297,12 @@ uint64_t odp_time_to_ns(odp_time_t time) odp_time_t odp_time_local_from_ns(uint64_t ns) { - return time_local_from_ns(ns); + return time_from_ns(ns); } odp_time_t odp_time_global_from_ns(uint64_t ns) { - return time_local_from_ns(ns); + return time_from_ns(ns); } int odp_time_cmp(odp_time_t t2, odp_time_t t1) @@ -154,18 +317,18 @@ odp_time_t odp_time_sum(odp_time_t t1, odp_time_t t2) uint64_t odp_time_local_res(void) { - return time_local_res(); + return time_res(); } uint64_t odp_time_global_res(void) { - return time_local_res(); + return time_res(); } void odp_time_wait_ns(uint64_t ns) { - odp_time_t cur = time_local(); - odp_time_t wait = time_local_from_ns(ns); + odp_time_t cur = time_cur(); + odp_time_t wait = time_from_ns(ns); odp_time_t end_time = time_sum(cur, wait); time_wait_until(end_time); @@ -193,15 +356,31 @@ uint64_t odp_time_to_u64(odp_time_t time) int odp_time_init_global(void) { - int ret; - struct timespec time; - - ret = clock_gettime(CLOCK_MONOTONIC_RAW, &time); - if (ret) { - start_time = ODP_TIME_NULL; - } else { - start_time.tv_sec = time.tv_sec; - start_time.tv_nsec = time.tv_nsec; + struct timespec sys_time; + int ret = 0; + + memset(&global, 0, sizeof(time_global_t)); + + if (cpu_has_global_time()) { + global.use_hw = 1; + global.hw_freq_hz = cpu_global_time_freq(); + + if (global.hw_freq_hz == 0) + return -1; + + printf("HW time counter freq: %" PRIu64 " hz\n\n", + global.hw_freq_hz); + + global.hw_start = cpu_global_time(); + return 0; + } + + global.start_time = ODP_TIME_NULL; + + ret = clock_gettime(CLOCK_MONOTONIC_RAW, &sys_time); + if (ret == 0) { + global.start_time.spec.tv_sec = sys_time.tv_sec; + global.start_time.spec.tv_nsec = sys_time.tv_nsec; } return ret;
Use 64 bit HW time counter when available. It is used on x86 when invariant TSC CPU flag indicates that TSC frequency is constant. Otherwise, the system time is used as before. Direct HW time counter usage avoids system call, and related latency and performance issues. Signed-off-by: Petri Savolainen <petri.savolainen@linaro.org> --- platform/linux-generic/Makefile.am | 1 + platform/linux-generic/arch/arm/odp_cpu_arch.c | 11 + platform/linux-generic/arch/default/odp_cpu_arch.c | 11 + platform/linux-generic/arch/mips64/odp_cpu_arch.c | 11 + platform/linux-generic/arch/powerpc/odp_cpu_arch.c | 11 + platform/linux-generic/arch/x86/cpu_flags.c | 9 + platform/linux-generic/arch/x86/odp_cpu_arch.c | 59 ++++ .../include/odp/api/plat/time_types.h | 23 +- platform/linux-generic/include/odp_time_internal.h | 24 ++ platform/linux-generic/odp_time.c | 303 ++++++++++++++++----- 10 files changed, 398 insertions(+), 65 deletions(-) create mode 100644 platform/linux-generic/include/odp_time_internal.h -- 2.11.0