Message ID | 20250528180100.172042-7-adhemerval.zanella@linaro.org |
---|---|
State | New |
Headers | show |
Series | Simplify and optimize modf/modff | expand |
On 5/28/25 7:59 AM, Adhemerval Zanella wrote: > The generic implementation is slight more optimized than the powerpc > one, where it has a more optimized inf/nan check (by not using FP > unit checks, along with branch prediction hints), and removed one > branch by issuing trunc instead of a combination of floor/ceil (which > also generated less code). > > On power10 with gcc 14.2.1: > > reciprocal-throughput master patch difference > workload-0_1 1.1351 0.9067 20.12% > workload-1_maxint 1.4230 0.9040 36.47% > workload-maxint_maxfloat 1.5038 0.9076 39.65% > workload-integral 1.1280 0.9111 19.23% > > latency master patch difference > workload-0_1 1.1440 2.7117 -137.03% > workload-1_maxint 4.0556 2.7070 33.25% > workload-maxint_maxfloat 3.2122 2.7164 15.43% > workload-integral 3.2381 2.7281 15.75% I like the idea of using an optimized generic routine over an arch specific routine, but I'm confused by your data. For the data above, it looks like the throughput is less with the patch (ie, smaller number) and the latency is longer with the patch (ie, bigger number). ...and not just on Power, but x86 as well. Is that just my confusion? If so, what are these data values actually specifying? Peter
On 28/05/25 17:31, Peter Bergner wrote: > On 5/28/25 7:59 AM, Adhemerval Zanella wrote: >> The generic implementation is slight more optimized than the powerpc >> one, where it has a more optimized inf/nan check (by not using FP >> unit checks, along with branch prediction hints), and removed one >> branch by issuing trunc instead of a combination of floor/ceil (which >> also generated less code). >> >> On power10 with gcc 14.2.1: >> >> reciprocal-throughput master patch difference >> workload-0_1 1.1351 0.9067 20.12% >> workload-1_maxint 1.4230 0.9040 36.47% >> workload-maxint_maxfloat 1.5038 0.9076 39.65% >> workload-integral 1.1280 0.9111 19.23% >> >> latency master patch difference >> workload-0_1 1.1440 2.7117 -137.03% >> workload-1_maxint 4.0556 2.7070 33.25% >> workload-maxint_maxfloat 3.2122 2.7164 15.43% >> workload-integral 3.2381 2.7281 15.75% > > I like the idea of using an optimized generic routine over an arch > specific routine, but I'm confused by your data. For the data above, > it looks like the throughput is less with the patch (ie, smaller number) > and the latency is longer with the patch (ie, bigger number). ...and not > just on Power, but x86 as well. Is that just my confusion? If so, what > are these data values actually specifying? You are right and I think I messed up the number with a wrong assumption about the reciprocal-throughput. I think the powerpc might the best implementation, the final copysign is indeed hurting performance even on aarch64. Let me recheck everything.
On 29/05/25 10:46, Adhemerval Zanella Netto wrote: > > > On 28/05/25 17:31, Peter Bergner wrote: >> On 5/28/25 7:59 AM, Adhemerval Zanella wrote: >>> The generic implementation is slight more optimized than the powerpc >>> one, where it has a more optimized inf/nan check (by not using FP >>> unit checks, along with branch prediction hints), and removed one >>> branch by issuing trunc instead of a combination of floor/ceil (which >>> also generated less code). >>> >>> On power10 with gcc 14.2.1: >>> >>> reciprocal-throughput master patch difference >>> workload-0_1 1.1351 0.9067 20.12% >>> workload-1_maxint 1.4230 0.9040 36.47% >>> workload-maxint_maxfloat 1.5038 0.9076 39.65% >>> workload-integral 1.1280 0.9111 19.23% >>> >>> latency master patch difference >>> workload-0_1 1.1440 2.7117 -137.03% >>> workload-1_maxint 4.0556 2.7070 33.25% >>> workload-maxint_maxfloat 3.2122 2.7164 15.43% >>> workload-integral 3.2381 2.7281 15.75% >> >> I like the idea of using an optimized generic routine over an arch >> specific routine, but I'm confused by your data. For the data above, >> it looks like the throughput is less with the patch (ie, smaller number) >> and the latency is longer with the patch (ie, bigger number). ...and not >> just on Power, but x86 as well. Is that just my confusion? If so, what >> are these data values actually specifying? > > You are right and I think I messed up the number with a wrong assumption > about the reciprocal-throughput. I think the powerpc might the best > implementation, the final copysign is indeed hurting performance even > on aarch64. Let me recheck everything. Sigh, I screw up my number and the new implementation is not really an improvement over current one. I might send a new version with the new benchtest and the powerpc removal (which is used solely for old powerpc chips and currently does not much).
diff --git a/sysdeps/powerpc/fpu/math-use-builtins-trunc.h b/sysdeps/powerpc/fpu/math-use-builtins-trunc.h index fd186d3510..3e6a55d2ca 100644 --- a/sysdeps/powerpc/fpu/math-use-builtins-trunc.h +++ b/sysdeps/powerpc/fpu/math-use-builtins-trunc.h @@ -1,8 +1,9 @@ -#define USE_TRUNC_BUILTIN 0 #ifdef _ARCH_PWR5X # define USE_TRUNCF_BUILTIN 1 +# define USE_TRUNC_BUILTIN 1 #else # define USE_TRUNCF_BUILTIN 0 +# define USE_TRUNC_BUILTIN 0 #endif #define USE_TRUNCL_BUILTIN 0 #define USE_TRUNCF128_BUILTIN 0 diff --git a/sysdeps/powerpc/fpu/s_modf.c b/sysdeps/powerpc/fpu/s_modf.c deleted file mode 100644 index 831072bdc3..0000000000 --- a/sysdeps/powerpc/fpu/s_modf.c +++ /dev/null @@ -1,59 +0,0 @@ -/* Copyright (C) 2013-2025 Free Software Foundation, Inc. - This file is part of the GNU C Library - - The GNU C Library is free software; you can redistribute it and/or - modify it under the terms of the GNU Library General Public License as - published by the Free Software Foundation; either version 2 of the - License, or (at your option) any later version. - - The GNU C Library is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - Library General Public License for more details. - - You should have received a copy of the GNU Library General Public - License along with the GNU C Library; see the file COPYING.LIB. If - not, see <https://www.gnu.org/licenses/>. */ - -/* ISA 2.07 provides fast GPR to FP instruction (mfvsr{d,wz}) which make - generic implementation faster. Also disables for old ISAs that do not - have ceil/floor instructions. */ -#if defined(_ARCH_PWR8) || !defined(_ARCH_PWR5X) -# include <sysdeps/ieee754/ldbl-opt/s_modf.c> -#else -# include <math.h> -# include <math_ldbl_opt.h> -# include <libm-alias-double.h> - -double -__modf (double x, double *iptr) -{ - if (__builtin_isinf (x)) - { - *iptr = x; - return copysign (0.0, x); - } - else if (__builtin_isnan (x)) - { - *iptr = NAN; - return NAN; - } - - if (x >= 0.0) - { - *iptr = floor (x); - return copysign (x - *iptr, x); - } - else - { - *iptr = ceil (x); - return copysign (x - *iptr, x); - } -} -# ifndef __modf -libm_alias_double (__modf, modf) -# if LONG_DOUBLE_COMPAT (libc, GLIBC_2_0) -compat_symbol (libc, __modf, modfl, GLIBC_2_0); -# endif -# endif -#endif diff --git a/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c b/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c index b8315c5b81..48f3a1997d 100644 --- a/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c +++ b/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c @@ -17,4 +17,4 @@ <https://www.gnu.org/licenses/>. */ #define __modf __modf_power5plus -#include <sysdeps/powerpc/fpu/s_modf.c> +#include <sysdeps/ieee754/dbl-64/s_modf.c> diff --git a/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c b/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c index b8315c5b81..48f3a1997d 100644 --- a/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c +++ b/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c @@ -17,4 +17,4 @@ <https://www.gnu.org/licenses/>. */ #define __modf __modf_power5plus -#include <sysdeps/powerpc/fpu/s_modf.c> +#include <sysdeps/ieee754/dbl-64/s_modf.c>