diff mbox series

[6/7] powerpc: Remove modf optimization

Message ID 20250528180100.172042-7-adhemerval.zanella@linaro.org
State New
Headers show
Series Simplify and optimize modf/modff | expand

Commit Message

Adhemerval Zanella Netto May 28, 2025, 5:59 p.m. UTC
The generic implementation is slight more optimized than the powerpc
one, where it has a more optimized inf/nan check (by not using FP
unit checks, along with branch prediction hints), and removed one
branch by issuing trunc instead of a combination of floor/ceil (which
also generated less code).

On power10 with gcc 14.2.1:

reciprocal-throughput        master         patch        difference
workload-0_1                 1.1351        0.9067            20.12%
workload-1_maxint            1.4230        0.9040            36.47%
workload-maxint_maxfloat     1.5038        0.9076            39.65%
workload-integral            1.1280        0.9111            19.23%

latency                      master         patch        difference
workload-0_1                 1.1440        2.7117          -137.03%
workload-1_maxint            4.0556        2.7070            33.25%
workload-maxint_maxfloat     3.2122        2.7164            15.43%
workload-integral            3.2381        2.7281            15.75%

Checked on powerpc64le-linux-gnu.
---
 sysdeps/powerpc/fpu/math-use-builtins-trunc.h |  3 +-
 sysdeps/powerpc/fpu/s_modf.c                  | 59 -------------------
 .../power4/fpu/multiarch/s_modf-power5+.c     |  2 +-
 .../be/fpu/multiarch/s_modf-power5+.c         |  2 +-
 4 files changed, 4 insertions(+), 62 deletions(-)
 delete mode 100644 sysdeps/powerpc/fpu/s_modf.c

Comments

Peter Bergner May 28, 2025, 8:31 p.m. UTC | #1
On 5/28/25 7:59 AM, Adhemerval Zanella wrote:
> The generic implementation is slight more optimized than the powerpc
> one, where it has a more optimized inf/nan check (by not using FP
> unit checks, along with branch prediction hints), and removed one
> branch by issuing trunc instead of a combination of floor/ceil (which
> also generated less code).
> 
> On power10 with gcc 14.2.1:
> 
> reciprocal-throughput        master         patch        difference
> workload-0_1                 1.1351        0.9067            20.12%
> workload-1_maxint            1.4230        0.9040            36.47%
> workload-maxint_maxfloat     1.5038        0.9076            39.65%
> workload-integral            1.1280        0.9111            19.23%
> 
> latency                      master         patch        difference
> workload-0_1                 1.1440        2.7117          -137.03%
> workload-1_maxint            4.0556        2.7070            33.25%
> workload-maxint_maxfloat     3.2122        2.7164            15.43%
> workload-integral            3.2381        2.7281            15.75%

I like the idea of using an optimized generic routine over an arch
specific routine, but I'm confused by your data.  For the data above,
it looks like the throughput is less with the patch (ie, smaller number)
and the latency is longer with the patch (ie, bigger number).  ...and not
just on Power, but x86 as well.  Is that just my confusion?  If so, what
are these data values actually specifying?

Peter
Adhemerval Zanella Netto May 29, 2025, 1:46 p.m. UTC | #2
On 28/05/25 17:31, Peter Bergner wrote:
> On 5/28/25 7:59 AM, Adhemerval Zanella wrote:
>> The generic implementation is slight more optimized than the powerpc
>> one, where it has a more optimized inf/nan check (by not using FP
>> unit checks, along with branch prediction hints), and removed one
>> branch by issuing trunc instead of a combination of floor/ceil (which
>> also generated less code).
>>
>> On power10 with gcc 14.2.1:
>>
>> reciprocal-throughput        master         patch        difference
>> workload-0_1                 1.1351        0.9067            20.12%
>> workload-1_maxint            1.4230        0.9040            36.47%
>> workload-maxint_maxfloat     1.5038        0.9076            39.65%
>> workload-integral            1.1280        0.9111            19.23%
>>
>> latency                      master         patch        difference
>> workload-0_1                 1.1440        2.7117          -137.03%
>> workload-1_maxint            4.0556        2.7070            33.25%
>> workload-maxint_maxfloat     3.2122        2.7164            15.43%
>> workload-integral            3.2381        2.7281            15.75%
> 
> I like the idea of using an optimized generic routine over an arch
> specific routine, but I'm confused by your data.  For the data above,
> it looks like the throughput is less with the patch (ie, smaller number)
> and the latency is longer with the patch (ie, bigger number).  ...and not
> just on Power, but x86 as well.  Is that just my confusion?  If so, what
> are these data values actually specifying?

You are right and I think I messed up the number with a wrong assumption
about the reciprocal-throughput. I think the powerpc might the best
implementation, the final copysign is indeed hurting performance even
on aarch64. Let me recheck everything.
Adhemerval Zanella Netto May 29, 2025, 2:02 p.m. UTC | #3
On 29/05/25 10:46, Adhemerval Zanella Netto wrote:
> 
> 
> On 28/05/25 17:31, Peter Bergner wrote:
>> On 5/28/25 7:59 AM, Adhemerval Zanella wrote:
>>> The generic implementation is slight more optimized than the powerpc
>>> one, where it has a more optimized inf/nan check (by not using FP
>>> unit checks, along with branch prediction hints), and removed one
>>> branch by issuing trunc instead of a combination of floor/ceil (which
>>> also generated less code).
>>>
>>> On power10 with gcc 14.2.1:
>>>
>>> reciprocal-throughput        master         patch        difference
>>> workload-0_1                 1.1351        0.9067            20.12%
>>> workload-1_maxint            1.4230        0.9040            36.47%
>>> workload-maxint_maxfloat     1.5038        0.9076            39.65%
>>> workload-integral            1.1280        0.9111            19.23%
>>>
>>> latency                      master         patch        difference
>>> workload-0_1                 1.1440        2.7117          -137.03%
>>> workload-1_maxint            4.0556        2.7070            33.25%
>>> workload-maxint_maxfloat     3.2122        2.7164            15.43%
>>> workload-integral            3.2381        2.7281            15.75%
>>
>> I like the idea of using an optimized generic routine over an arch
>> specific routine, but I'm confused by your data.  For the data above,
>> it looks like the throughput is less with the patch (ie, smaller number)
>> and the latency is longer with the patch (ie, bigger number).  ...and not
>> just on Power, but x86 as well.  Is that just my confusion?  If so, what
>> are these data values actually specifying?
> 
> You are right and I think I messed up the number with a wrong assumption
> about the reciprocal-throughput. I think the powerpc might the best
> implementation, the final copysign is indeed hurting performance even
> on aarch64. Let me recheck everything.

Sigh, I screw up my number and the new implementation is not really an
improvement over current one.  I might send a new version with the new
benchtest and the powerpc removal (which is used solely for old powerpc
chips and currently does not much).
diff mbox series

Patch

diff --git a/sysdeps/powerpc/fpu/math-use-builtins-trunc.h b/sysdeps/powerpc/fpu/math-use-builtins-trunc.h
index fd186d3510..3e6a55d2ca 100644
--- a/sysdeps/powerpc/fpu/math-use-builtins-trunc.h
+++ b/sysdeps/powerpc/fpu/math-use-builtins-trunc.h
@@ -1,8 +1,9 @@ 
-#define USE_TRUNC_BUILTIN 0
 #ifdef _ARCH_PWR5X
 # define USE_TRUNCF_BUILTIN 1
+# define USE_TRUNC_BUILTIN 1
 #else
 # define USE_TRUNCF_BUILTIN 0
+# define USE_TRUNC_BUILTIN 0
 #endif
 #define USE_TRUNCL_BUILTIN 0
 #define USE_TRUNCF128_BUILTIN 0
diff --git a/sysdeps/powerpc/fpu/s_modf.c b/sysdeps/powerpc/fpu/s_modf.c
deleted file mode 100644
index 831072bdc3..0000000000
--- a/sysdeps/powerpc/fpu/s_modf.c
+++ /dev/null
@@ -1,59 +0,0 @@ 
-/* Copyright (C) 2013-2025 Free Software Foundation, Inc.
-   This file is part of the GNU C Library
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Library General Public License as
-   published by the Free Software Foundation; either version 2 of the
-   License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Library General Public License for more details.
-
-   You should have received a copy of the GNU Library General Public
-   License along with the GNU C Library; see the file COPYING.LIB.  If
-   not, see <https://www.gnu.org/licenses/>.  */
-
-/* ISA 2.07 provides fast GPR to FP instruction (mfvsr{d,wz}) which make
-   generic implementation faster.  Also disables for old ISAs that do not
-   have ceil/floor instructions.  */
-#if defined(_ARCH_PWR8) || !defined(_ARCH_PWR5X)
-# include <sysdeps/ieee754/ldbl-opt/s_modf.c>
-#else
-# include <math.h>
-# include <math_ldbl_opt.h>
-# include <libm-alias-double.h>
-
-double
-__modf (double x, double *iptr)
-{
-  if (__builtin_isinf (x))
-    {
-      *iptr = x;
-      return copysign (0.0, x);
-    }
-  else if (__builtin_isnan (x))
-    {
-      *iptr = NAN;
-      return NAN;
-    }
-
-  if (x >= 0.0)
-    {
-      *iptr = floor (x);
-      return copysign (x - *iptr, x);
-    }
-  else
-    {
-      *iptr = ceil (x);
-      return copysign (x - *iptr, x);
-    }
-}
-# ifndef __modf
-libm_alias_double (__modf, modf)
-#  if LONG_DOUBLE_COMPAT (libc, GLIBC_2_0)
-compat_symbol (libc, __modf, modfl, GLIBC_2_0);
-#  endif
-# endif
-#endif
diff --git a/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c b/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c
index b8315c5b81..48f3a1997d 100644
--- a/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c
+++ b/sysdeps/powerpc/powerpc32/power4/fpu/multiarch/s_modf-power5+.c
@@ -17,4 +17,4 @@ 
    <https://www.gnu.org/licenses/>.  */
 
 #define __modf __modf_power5plus
-#include <sysdeps/powerpc/fpu/s_modf.c>
+#include <sysdeps/ieee754/dbl-64/s_modf.c>
diff --git a/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c b/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c
index b8315c5b81..48f3a1997d 100644
--- a/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c
+++ b/sysdeps/powerpc/powerpc64/be/fpu/multiarch/s_modf-power5+.c
@@ -17,4 +17,4 @@ 
    <https://www.gnu.org/licenses/>.  */
 
 #define __modf __modf_power5plus
-#include <sysdeps/powerpc/fpu/s_modf.c>
+#include <sysdeps/ieee754/dbl-64/s_modf.c>