[0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

Message ID	20230109135828.879136-1-mark.rutland@arm.com
Headers	show Return-Path: <linux-acpi-owner@vger.kernel.org> From: Mark Rutland <mark.rutland@arm.com> To: linux-arm-kernel@lists.infradead.org Cc: catalin.marinas@arm.com, lenb@kernel.org, linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, mark.rutland@arm.com, mhiramat@kernel.org, ndesaulniers@google.com, ojeda@kernel.org, peterz@infradead.org, rafael.j.wysocki@intel.com, revest@chromium.org, robert.moore@intel.com, rostedt@goodmis.org, will@kernel.org Subject: [PATCH 0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS Date: Mon, 9 Jan 2023 13:58:20 +0000 Message-Id: <20230109135828.879136-1-mark.rutland@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS \| expand [0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS [2/8] ACPI: Don't build ACPICA with '-Os' [3/8] arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT [4/8] ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS [5/8] arm64: insn: Add helpers for BTI [6/8] arm64: patching: Add aarch64_insn_write_literal_u64() [7/8] arm64: ftrace: Update stale comment [8/8] arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

Mark Rutland Jan. 9, 2023, 1:58 p.m. UTC

This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
enables support for this on arm64. This significantly reduces the
overhead of tracing when a callsite/tracee has a single associated
tracer, avoids a number of issues that make it undesireably and
infeasible to use dynamically-allocated trampolines (e.g. branch range
limitations), and makes it possible to implement support for
DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.

The main idea is to give each ftrace callsite an associated pointer to
an ftrace_ops. The architecture's ftrace_caller trampoline can recover
the ops pointer and invoke ops->func from this without needing to use
ftrace_ops_list_func, which has to iterate through all registered ops.

To do this, we use -fpatchable-function-entry=M,N, there N NOPs are
placed before the function entry point. On arm64 NOPs are always 4
bytes, so by allocating 2 per-function NOPs, we have enaough space to
place a 64-bit value. So that we can manipulate the pointer atomically,
we need to align instrumented functions to at least 8 bytes.

The first three patches enable this function alignment, requiring
changes to the ACPICA Makefile, and working around cases where GCC drops
alignment.

The final four patches implement support for arm64. As noted in the
final patch, this results in a significant reduction in overhead:

  Before this patch:

  Number of tracers     || Total time  | Per-call average time (ns)
  Relevant | Irrelevant || (ns)        | Total        | Overhead
  =========+============++=============+==============+============
         0 |          0 ||      94,583 |         0.95 |           -
         0 |          1 ||      93,709 |         0.94 |           -
         0 |          2 ||      93,666 |         0.94 |           -
         0 |         10 ||      93,709 |         0.94 |           -
         0 |        100 ||      93,792 |         0.94 |           -
  ---------+------------++-------------+--------------+------------
         1 |          1 ||   6,467,833 |        64.68 |       63.73
         1 |          2 ||   7,509,708 |        75.10 |       74.15
         1 |         10 ||  23,786,792 |       237.87 |      236.92
         1 |        100 || 106,432,500 |     1,064.43 |     1063.38
  ---------+------------++-------------+--------------+------------
         1 |          0 ||   1,431,875 |        14.32 |       13.37
         2 |          0 ||   6,456,334 |        64.56 |       63.62
        10 |          0 ||  22,717,000 |       227.17 |      226.22
       100 |          0 || 103,293,667 |      1032.94 |     1031.99
  ---------+------------++-------------+--------------+--------------

  Note: per-call overhead is estiamated relative to the baseline case
  with 0 relevant tracers and 0 irrelevant tracers.

  After this patch

  Number of tracers     || Total time  | Per-call average time (ns)
  Relevant | Irrelevant || (ns)        | Total        | Overhead
  =========+============++=============+==============+============
         0 |          0 ||      94,541 |         0.95 |           -
         0 |          1 ||      93,666 |         0.94 |           -
         0 |          2 ||      93,709 |         0.94 |           -
         0 |         10 ||      93,667 |         0.94 |           -
         0 |        100 ||      93,792 |         0.94 |           -
  ---------+------------++-------------+--------------+------------
         1 |          1 ||     281,000 |         2.81 |        1.86
         1 |          2 ||     281,042 |         2.81 |        1.87
         1 |         10 ||     280,958 |         2.81 |        1.86
         1 |        100 ||     281,250 |         2.81 |        1.87
  ---------+------------++-------------+--------------+------------
         1 |          0 ||     280,959 |         2.81 |        1.86
         2 |          0 ||   6,502,708 |        65.03 |       64.08
        10 |          0 ||  18,681,209 |       186.81 |      185.87
       100 |          0 || 103,550,458 |     1,035.50 |     1034.56
  ---------+------------++-------------+--------------+------------

  Note: per-call overhead is estiamated relative to the baseline case
  with 0 relevant tracers and 0 irrelevant tracers.

Thanks,
Mark.

Mark Rutland (8):
  Compiler attributes: GCC function alignment workarounds
  ACPI: Don't build ACPICA with '-Os'
  arm64: Extend support for CONFIG_FUNCTION_ALIGNMENT
  ftrace: Add DYNAMIC_FTRACE_WITH_CALL_OPS
  arm64: insn: Add helpers for BTI
  arm64: patching: Add aarch64_insn_write_literal_u64()
  arm64: ftrace: Update stale comment
  arm64: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

 arch/arm64/Kconfig                  |   3 +
 arch/arm64/Makefile                 |   5 +-
 arch/arm64/include/asm/ftrace.h     |  15 +--
 arch/arm64/include/asm/insn.h       |   1 +
 arch/arm64/include/asm/linkage.h    |  10 +-
 arch/arm64/include/asm/patching.h   |   2 +
 arch/arm64/kernel/asm-offsets.c     |   4 +
 arch/arm64/kernel/entry-ftrace.S    |  32 +++++-
 arch/arm64/kernel/ftrace.c          | 158 +++++++++++++++++++++++++++-
 arch/arm64/kernel/patching.c        |  17 +++
 drivers/acpi/acpica/Makefile        |   2 +-
 include/linux/compiler_attributes.h |  23 +++-
 include/linux/ftrace.h              |  15 ++-
 kernel/trace/Kconfig                |   7 ++
 kernel/trace/ftrace.c               | 109 ++++++++++++++++++-
 15 files changed, 371 insertions(+), 32 deletions(-)

Miguel Ojeda Jan. 9, 2023, 10:35 p.m. UTC | #1

On Mon, Jan 9, 2023 at 6:06 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Sorry, that is something I had intendeed to do but I hadn't extracted a
> reproducer yet. I'll try to come up with something that can be included in the
> commit message and reported to GCC folk (and double-check at the same time that
> there's not another hidden cause)

Yeah, no worries :) I suggested it because from my quick test it
didn't appear to be reproducible trivially, so I thought having the
reproducer would be nice.

> I'm happy to move these, I just wasn't sure what the policy would be w.r.t. the
> existing __weak and __cold defitions since those end up depending upon
> __function_aligned.
>
> I assume I should move them all? i.e. move __weak as well?

Yeah, with the current policy, all should be moved since their
behavior now depends on the config (eventually).

Cheers,
Miguel

David Laight Jan. 10, 2023, 8:55 a.m. UTC | #2

From: Mark Rutland
> Sent: 09 January 2023 13:58
> 
> This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
> enables support for this on arm64. This significantly reduces the
> overhead of tracing when a callsite/tracee has a single associated
> tracer, avoids a number of issues that make it undesireably and
> infeasible to use dynamically-allocated trampolines (e.g. branch range
> limitations), and makes it possible to implement support for
> DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.
> 
> The main idea is to give each ftrace callsite an associated pointer to
> an ftrace_ops. The architecture's ftrace_caller trampoline can recover
> the ops pointer and invoke ops->func from this without needing to use
> ftrace_ops_list_func, which has to iterate through all registered ops.
> 
> To do this, we use -fpatchable-function-entry=M,N, there N NOPs are
> placed before the function entry point...

Doesn't this bump the minimum gcc version up to something like 9.0 ?

How does it interact with the 'CFI stuff' that also uses the same area?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Mark Rutland Jan. 10, 2023, 10:31 a.m. UTC | #3

On Tue, Jan 10, 2023 at 08:55:58AM +0000, David Laight wrote:
> From: Mark Rutland
> > Sent: 09 January 2023 13:58
> > 
> > This series adds a new DYNAMIC_FTRACE_WITH_CALL_OPS mechanism, and
> > enables support for this on arm64. This significantly reduces the
> > overhead of tracing when a callsite/tracee has a single associated
> > tracer, avoids a number of issues that make it undesireably and
> > infeasible to use dynamically-allocated trampolines (e.g. branch range
> > limitations), and makes it possible to implement support for
> > DYNAMIC_FTRACE_WITH_DIRECT_CALLS in future.
> > 
> > The main idea is to give each ftrace callsite an associated pointer to
> > an ftrace_ops. The architecture's ftrace_caller trampoline can recover
> > the ops pointer and invoke ops->func from this without needing to use
> > ftrace_ops_list_func, which has to iterate through all registered ops.
> > 
> > To do this, we use -fpatchable-function-entry=M,N, there N NOPs are
> > placed before the function entry point...
> 
> Doesn't this bump the minimum gcc version up to something like 9.0 ?

This doesn't bump the minimum GCC version, but users of older toolchains
won't get the speedup.

We already support -fpatchable-function-entry based ftrace with GCC 8+ (and
this is necessary to play nicely with pointer authentication), for older GCC
versions we still support using -pg / mcount.

> How does it interact with the 'CFI stuff' that also uses the same area?

There's some more detail in patch 8, but the summary is that they're mutually
exclusive for now (enforce by Kconfig), and I'm working with others to get
improved compiler support necessary for them to play nicely together.

Currently LLVM will place the type-hash before the pre-function NOPs, which
works if everything has pre-function NOPs, but doesn't work for calls between
instrumented and non-instrumented functions, since as the latter don't have
pre-function NOPs and the type hash is placed at a different offset. To make
them work better together we'll need some improved compiler support, and I'm
working with others for that currently.

GCC doesn't currently have KCFI support, but the plan is to match whatever LLVM
does.

Atop that we'll need some trivial changes to the asm function macros, but
without the underlying compiler support there's not much point.

Thanks,
Mark.

Peter Zijlstra Jan. 10, 2023, 8:35 p.m. UTC | #4

On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:

> diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> index 1436fa1cde24d..df18a3446ce82 100644
> --- a/arch/arm64/include/asm/linkage.h
> +++ b/arch/arm64/include/asm/linkage.h
> @@ -5,8 +5,14 @@
>  #include <asm/assembler.h>
>  #endif
>  
> -#define __ALIGN		.align 2
> -#define __ALIGN_STR	".align 2"
> +#if CONFIG_FUNCTION_ALIGNMENT > 0
> +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> +#else
> +#define ARM64_FUNCTION_ALIGNMENT	4
> +#endif
> +
> +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT

Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
and simply removing all these lines and relying on the default
behaviour?

Will Deacon Jan. 10, 2023, 8:43 p.m. UTC | #5

On Tue, Jan 10, 2023 at 09:35:18PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:
> 
> > diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> > index 1436fa1cde24d..df18a3446ce82 100644
> > --- a/arch/arm64/include/asm/linkage.h
> > +++ b/arch/arm64/include/asm/linkage.h
> > @@ -5,8 +5,14 @@
> >  #include <asm/assembler.h>
> >  #endif
> >  
> > -#define __ALIGN		.align 2
> > -#define __ALIGN_STR	".align 2"
> > +#if CONFIG_FUNCTION_ALIGNMENT > 0
> > +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> > +#else
> > +#define ARM64_FUNCTION_ALIGNMENT	4
> > +#endif
> > +
> > +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> > +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT
> 
> Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
> and simply removing all these lines and relying on the default
> behaviour?

There's a proposal (with some rough performance claims) to select
FUNCTION_ALIGNMENT_16B over at:

https://lore.kernel.org/r/20221208053649.540891-1-almasrymina@google.com

so we could just go with that?

Will

Mark Rutland Jan. 11, 2023, 11:36 a.m. UTC | #6

On Tue, Jan 10, 2023 at 09:35:18PM +0100, Peter Zijlstra wrote:
> On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:
> 
> > diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> > index 1436fa1cde24d..df18a3446ce82 100644
> > --- a/arch/arm64/include/asm/linkage.h
> > +++ b/arch/arm64/include/asm/linkage.h
> > @@ -5,8 +5,14 @@
> >  #include <asm/assembler.h>
> >  #endif
> >  
> > -#define __ALIGN		.align 2
> > -#define __ALIGN_STR	".align 2"
> > +#if CONFIG_FUNCTION_ALIGNMENT > 0
> > +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> > +#else
> > +#define ARM64_FUNCTION_ALIGNMENT	4
> > +#endif
> > +
> > +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> > +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT
> 
> Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
> and simply removing all these lines and relying on the default
> behaviour?

Yes, it is.

I was focussed on obviously retaining the existing semantic by default, and I
missed that was possible by selecting FUNCTION_ALIGNMENT_4B.

Thanks,
Mark.

Mark Rutland Jan. 11, 2023, 11:39 a.m. UTC | #7

On Tue, Jan 10, 2023 at 08:43:20PM +0000, Will Deacon wrote:
> On Tue, Jan 10, 2023 at 09:35:18PM +0100, Peter Zijlstra wrote:
> > On Mon, Jan 09, 2023 at 01:58:23PM +0000, Mark Rutland wrote:
> > 
> > > diff --git a/arch/arm64/include/asm/linkage.h b/arch/arm64/include/asm/linkage.h
> > > index 1436fa1cde24d..df18a3446ce82 100644
> > > --- a/arch/arm64/include/asm/linkage.h
> > > +++ b/arch/arm64/include/asm/linkage.h
> > > @@ -5,8 +5,14 @@
> > >  #include <asm/assembler.h>
> > >  #endif
> > >  
> > > -#define __ALIGN		.align 2
> > > -#define __ALIGN_STR	".align 2"
> > > +#if CONFIG_FUNCTION_ALIGNMENT > 0
> > > +#define ARM64_FUNCTION_ALIGNMENT	CONFIG_FUNCTION_ALIGNMENT
> > > +#else
> > > +#define ARM64_FUNCTION_ALIGNMENT	4
> > > +#endif
> > > +
> > > +#define __ALIGN		.balign ARM64_FUNCTION_ALIGNMENT
> > > +#define __ALIGN_STR	".balign " #ARM64_FUNCTION_ALIGNMENT
> > 
> > Isn't that much the same as having ARM64 select FUNCTION_ALIGNMENT_4B
> > and simply removing all these lines and relying on the default
> > behaviour?
> 
> There's a proposal (with some rough performance claims) to select
> FUNCTION_ALIGNMENT_16B over at:
> 
> https://lore.kernel.org/r/20221208053649.540891-1-almasrymina@google.com
> 
> so we could just go with that?

I reckon it'd be worth having that as a separate patch atop, to split the
infrastructure from the actual change, but I'm happy to go with 16B immediately
if you'd prefer.

It'd be nice if we could get some numbers...

Thanks,
Mark.

Mark Rutland Jan. 11, 2023, 6:27 p.m. UTC | #8

On Mon, Jan 09, 2023 at 03:43:16PM +0100, Miguel Ojeda wrote:
> On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > As far as I can tell, GCC doesn't respect '-falign-functions=N':
> >
> > * When the __weak__ attribute is used
> >
> >   GCC seems to forget the alignment specified by '-falign-functions=N',
> >   but will respect the '__aligned__(N)' function attribute. Thus, we can
> >   work around this by explciitly setting the alignment for weak
> >   functions.
> >
> > * When the __cold__ attribute is used
> >
> >   GCC seems to forget the alignment specified by '-falign-functions=N',
> >   and also doesn't seem to respect the '__aligned__(N)' function
> >   attribute. The only way to work around this is to not use the __cold__
> >   attibute.
> 
> If you happen to have a reduced case, then it would be nice to link it
> in the commit. A bug report to GCC would also be nice.
> 
> I gave it a very quick try in Compiler Explorer, but I couldn't
> reproduce it, so I guess it depends on flags, non-trivial functions or
> something else.

So having spent today coming up with tests, it turns out it's not quite as I
described above, but in a sense worse. I'm posting a summary here for
posterity; I'll try to get this to compiler folk shortly.

GCC appears to not align cold functions to the alignment specified by
`-falign-functions=N` when compiling at `-O1` or above. Alignment *can* be
restored with explicit attributes on each function, but due to some
interprocedural analysis, callees can be implicitly marked as cold (losing
their default alignment), which means we don't have a reliable mechanism to
ensure functions are always aligned short of annotating *every* function
explicitly (and I suspect that's not sufficient due to some interprocedural optimizations).

I've tested with the 12.1.0 binary release from the kernel.org cross toolchains
page).

LLVM always seems to repsect `-falign-functions=N` at both `-O1` and `-O2` (I
tested the 10.0.0, 11.0.0, 11.0.1, 15.0.6 binary releases from llvm.org).

For example:

| [mark@lakrids:/mnt/data/tests/gcc-alignment]% cat test-cold.c                                           
| #define __cold \
|         __attribute__((cold))
| 
| #define EXPORT_FUNC_PTR(func) \
|         typeof((func)) *__ptr_##func = (func)
| 
| __cold
| void cold_func_a(void) { }
| 
| __cold
| void cold_func_b(void) { }
| 
| __cold
| void cold_func_c(void) { }
| 
| static __cold
| void static_cold_func_a(void) { }
| EXPORT_FUNC_PTR(static_cold_func_a);
| 
| static __cold
| void static_cold_func_b(void) { }
| EXPORT_FUNC_PTR(static_cold_func_b);
| 
| static __cold
| void static_cold_func_c(void) { }
| EXPORT_FUNC_PTR(static_cold_func_c);
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-gcc -falign-functions=16 -c test-cold.c -O1
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -d test-cold.o                     
| 
| test-cold.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <static_cold_func_a>:
|    0:   d65f03c0        ret
| 
| 0000000000000004 <static_cold_func_b>:
|    4:   d65f03c0        ret
| 
| 0000000000000008 <static_cold_func_c>:
|    8:   d65f03c0        ret
| 
| 000000000000000c <cold_func_a>:
|    c:   d65f03c0        ret
| 
| 0000000000000010 <cold_func_b>:
|   10:   d65f03c0        ret
| 
| 0000000000000014 <cold_func_c>:
|   14:   d65f03c0        ret
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -h test-cold.o
| 
| test-cold.o:     file format elf64-littleaarch64
| 
| Sections:
| Idx Name          Size      VMA               LMA               File off  Algn
|   0 .text         00000018  0000000000000000  0000000000000000  00000040  2**2
|                   CONTENTS, ALLOC, LOAD, READONLY, CODE
|   1 .data         00000018  0000000000000000  0000000000000000  00000058  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, DATA
|   2 .bss          00000000  0000000000000000  0000000000000000  00000070  2**0
|                   ALLOC
|   3 .comment      00000013  0000000000000000  0000000000000000  00000070  2**0
|                   CONTENTS, READONLY
|   4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00000083  2**0
|                   CONTENTS, READONLY
|   5 .eh_frame     00000090  0000000000000000  0000000000000000  00000088  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

In simple cases, alignment *can* be restored if an explicit function attribute
is used. For example:

| [mark@lakrids:/mnt/data/tests/gcc-alignment]% cat test-aligned-cold.c                            
| #define __aligned(n) \
|         __attribute__((aligned(n)))
| 
| #define __cold \
|         __attribute__((cold)) __aligned(16)
| 
| #define EXPORT_FUNC_PTR(func) \
|         typeof((func)) *__ptr_##func = (func)
| 
| __cold
| void cold_func_a(void) { }
| 
| __cold
| void cold_func_b(void) { }
| 
| __cold
| void cold_func_c(void) { }
| 
| static __cold
| void static_cold_func_a(void) { }
| EXPORT_FUNC_PTR(static_cold_func_a);
| 
| static __cold
| void static_cold_func_b(void) { }
| EXPORT_FUNC_PTR(static_cold_func_b);
| 
| static __cold
| void static_cold_func_c(void) { }
| EXPORT_FUNC_PTR(static_cold_func_c);
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-gcc -falign-functions=16 -c test-aligned-cold.c -O1
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -d test-aligned-cold.o                     
| 
| test-aligned-cold.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <static_cold_func_a>:
|    0:   d65f03c0        ret
|    4:   d503201f        nop
|    8:   d503201f        nop
|    c:   d503201f        nop
| 
| 0000000000000010 <static_cold_func_b>:
|   10:   d65f03c0        ret
|   14:   d503201f        nop
|   18:   d503201f        nop
|   1c:   d503201f        nop
| 
| 0000000000000020 <static_cold_func_c>:
|   20:   d65f03c0        ret
|   24:   d503201f        nop
|   28:   d503201f        nop
|   2c:   d503201f        nop
| 
| 0000000000000030 <cold_func_a>:
|   30:   d65f03c0        ret
|   34:   d503201f        nop
|   38:   d503201f        nop
|   3c:   d503201f        nop
| 
| 0000000000000040 <cold_func_b>:
|   40:   d65f03c0        ret
|   44:   d503201f        nop
|   48:   d503201f        nop
|   4c:   d503201f        nop
| 
| 0000000000000050 <cold_func_c>:
|   50:   d65f03c0        ret
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -h test-aligned-cold.o
| 
| test-aligned-cold.o:     file format elf64-littleaarch64
| 
| Sections:
| Idx Name          Size      VMA               LMA               File off  Algn
|   0 .text         00000054  0000000000000000  0000000000000000  00000040  2**4
|                   CONTENTS, ALLOC, LOAD, READONLY, CODE
|   1 .data         00000018  0000000000000000  0000000000000000  00000098  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, DATA
|   2 .bss          00000000  0000000000000000  0000000000000000  000000b0  2**0
|                   ALLOC
|   3 .comment      00000013  0000000000000000  0000000000000000  000000b0  2**0
|                   CONTENTS, READONLY
|   4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  000000c3  2**0
|                   CONTENTS, READONLY
|   5 .eh_frame     00000090  0000000000000000  0000000000000000  000000c8  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA


Unfortunately it appears that some interprocedural analysis determines that if
a callee is only called/referenced from cold callers, the callee is marked as
cold, and the alignment it would have got from the command line option is
dropped. If it's given an explicit alignment attribute, the alignment is
retained.

For example:

| [mark@lakrids:/mnt/data/tests/gcc-alignment]% cat test-aligned-cold-caller.c                                    
| #define noinline \
|         __attribute__((noinline))
| 
| #define __aligned(n) \
|         __attribute__((aligned(n)))
| 
| #define __cold \
|         __attribute__((cold)) __aligned(16)
| 
| #define EXPORT_FUNC_PTR(func) \
|         typeof((func)) *__ptr_##func = (func)
| 
| static noinline void callee_a(void)
| {
|         asm volatile("// callee_a\n" ::: "memory");
| }
| 
| static noinline void callee_b(void)
| {
|         asm volatile("// callee_b\n" ::: "memory");
| }
| 
| static noinline void callee_c(void)
| {
|         asm volatile("// callee_c\n" ::: "memory");
| }
| __cold
| void cold_func_a(void) { callee_a(); }
| 
| __cold
| void cold_func_b(void) { callee_b(); }
| 
| __cold
| void cold_func_c(void) { callee_c(); }
| 
| static __cold
| void static_cold_func_a(void) { callee_a(); }
| EXPORT_FUNC_PTR(static_cold_func_a);
| 
| static __cold
| void static_cold_func_b(void) { callee_b(); }
| EXPORT_FUNC_PTR(static_cold_func_b);
| 
| static __cold
| void static_cold_func_c(void) { callee_c(); }
| EXPORT_FUNC_PTR(static_cold_func_c);
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-gcc -falign-functions=16 -c test-aligned-cold-caller.c -O1
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -d test-aligned-cold-caller.o                     
| 
| test-aligned-cold-caller.o:     file format elf64-littleaarch64
| 
| 
| Disassembly of section .text:
| 
| 0000000000000000 <callee_a>:
|    0:   d65f03c0        ret
| 
| 0000000000000004 <callee_b>:
|    4:   d65f03c0        ret
| 
| 0000000000000008 <callee_c>:
|    8:   d65f03c0        ret
|    c:   d503201f        nop
| 
| 0000000000000010 <static_cold_func_a>:
|   10:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   14:   910003fd        mov     x29, sp
|   18:   97fffffa        bl      0 <callee_a>
|   1c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   20:   d65f03c0        ret
|   24:   d503201f        nop
|   28:   d503201f        nop
|   2c:   d503201f        nop
| 
| 0000000000000030 <static_cold_func_b>:
|   30:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   34:   910003fd        mov     x29, sp
|   38:   97fffff3        bl      4 <callee_b>
|   3c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   40:   d65f03c0        ret
|   44:   d503201f        nop
|   48:   d503201f        nop
|   4c:   d503201f        nop
| 
| 0000000000000050 <static_cold_func_c>:
|   50:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   54:   910003fd        mov     x29, sp
|   58:   97ffffec        bl      8 <callee_c>
|   5c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   60:   d65f03c0        ret
|   64:   d503201f        nop
|   68:   d503201f        nop
|   6c:   d503201f        nop
| 
| 0000000000000070 <cold_func_a>:
|   70:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   74:   910003fd        mov     x29, sp
|   78:   97ffffe2        bl      0 <callee_a>
|   7c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   80:   d65f03c0        ret
|   84:   d503201f        nop
|   88:   d503201f        nop
|   8c:   d503201f        nop
| 
| 0000000000000090 <cold_func_b>:
|   90:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   94:   910003fd        mov     x29, sp
|   98:   97ffffdb        bl      4 <callee_b>
|   9c:   a8c17bfd        ldp     x29, x30, [sp], #16
|   a0:   d65f03c0        ret
|   a4:   d503201f        nop
|   a8:   d503201f        nop
|   ac:   d503201f        nop
| 
| 00000000000000b0 <cold_func_c>:
|   b0:   a9bf7bfd        stp     x29, x30, [sp, #-16]!
|   b4:   910003fd        mov     x29, sp
|   b8:   97ffffd4        bl      8 <callee_c>
|   bc:   a8c17bfd        ldp     x29, x30, [sp], #16
|   c0:   d65f03c0        ret
| [mark@lakrids:/mnt/data/tests/gcc-alignment]% usekorg 12.1.0 aarch64-linux-objdump -h test-aligned-cold-caller.o
| 
| test-aligned-cold-caller.o:     file format elf64-littleaarch64
| 
| Sections:
| Idx Name          Size      VMA               LMA               File off  Algn
|   0 .text         000000c4  0000000000000000  0000000000000000  00000040  2**4
|                   CONTENTS, ALLOC, LOAD, READONLY, CODE
|   1 .data         00000018  0000000000000000  0000000000000000  00000108  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, DATA
|   2 .bss          00000000  0000000000000000  0000000000000000  00000120  2**0
|                   ALLOC
|   3 .comment      00000013  0000000000000000  0000000000000000  00000120  2**0
|                   CONTENTS, READONLY
|   4 .note.GNU-stack 00000000  0000000000000000  0000000000000000  00000133  2**0
|                   CONTENTS, READONLY
|   5 .eh_frame     00000110  0000000000000000  0000000000000000  00000138  2**3
|                   CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA

Thanks,
Mark.

Mark Rutland Jan. 12, 2023, 11:38 a.m. UTC | #9

On Wed, Jan 11, 2023 at 06:27:53PM +0000, Mark Rutland wrote:
> On Mon, Jan 09, 2023 at 03:43:16PM +0100, Miguel Ojeda wrote:
> > On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > As far as I can tell, GCC doesn't respect '-falign-functions=N':
> > >
> > > * When the __weak__ attribute is used
> > >
> > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > >   but will respect the '__aligned__(N)' function attribute. Thus, we can
> > >   work around this by explciitly setting the alignment for weak
> > >   functions.
> > >
> > > * When the __cold__ attribute is used
> > >
> > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > >   and also doesn't seem to respect the '__aligned__(N)' function
> > >   attribute. The only way to work around this is to not use the __cold__
> > >   attibute.
> > 
> > If you happen to have a reduced case, then it would be nice to link it
> > in the commit. A bug report to GCC would also be nice.
> > 
> > I gave it a very quick try in Compiler Explorer, but I couldn't
> > reproduce it, so I guess it depends on flags, non-trivial functions or
> > something else.
> 
> So having spent today coming up with tests, it turns out it's not quite as I
> described above, but in a sense worse. I'm posting a summary here for
> posterity; I'll try to get this to compiler folk shortly.

I've added the cold bits to an existing ticket:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345

I have not been able to reproduce the issue with __weak__, so I'll go dig into
that some more; it's likely I was mistaken there.

Thanks,
Mark.

Mark Rutland Jan. 13, 2023, 12:49 p.m. UTC | #10

On Thu, Jan 12, 2023 at 11:38:17AM +0000, Mark Rutland wrote:
> On Wed, Jan 11, 2023 at 06:27:53PM +0000, Mark Rutland wrote:
> > On Mon, Jan 09, 2023 at 03:43:16PM +0100, Miguel Ojeda wrote:
> > > On Mon, Jan 9, 2023 at 2:58 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > As far as I can tell, GCC doesn't respect '-falign-functions=N':
> > > >
> > > > * When the __weak__ attribute is used
> > > >
> > > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > > >   but will respect the '__aligned__(N)' function attribute. Thus, we can
> > > >   work around this by explciitly setting the alignment for weak
> > > >   functions.
> > > >
> > > > * When the __cold__ attribute is used
> > > >
> > > >   GCC seems to forget the alignment specified by '-falign-functions=N',
> > > >   and also doesn't seem to respect the '__aligned__(N)' function
> > > >   attribute. The only way to work around this is to not use the __cold__
> > > >   attibute.
> > > 
> > > If you happen to have a reduced case, then it would be nice to link it
> > > in the commit. A bug report to GCC would also be nice.
> > > 
> > > I gave it a very quick try in Compiler Explorer, but I couldn't
> > > reproduce it, so I guess it depends on flags, non-trivial functions or
> > > something else.
> > 
> > So having spent today coming up with tests, it turns out it's not quite as I
> > described above, but in a sense worse. I'm posting a summary here for
> > posterity; I'll try to get this to compiler folk shortly.
> 
> I've added the cold bits to an existing ticket:
> 
>   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345
> 
> I have not been able to reproduce the issue with __weak__, so I'll go dig into
> that some more; it's likely I was mistaken there.

It turns out that was a red herring; GCC is actually implicitly marking the
abort() function as cold, and as Linux's implementation happened to be marked
as weak I assumed that was the culprit.

I'll drop the changes to weak and update our abort implementation specifically,
with a comment.

I'll also go update the ticket above.

Thanks,
Mark.

Miguel Ojeda Jan. 15, 2023, 9:32 p.m. UTC | #11

On Fri, Jan 13, 2023 at 1:49 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> It turns out that was a red herring; GCC is actually implicitly marking the
> abort() function as cold, and as Linux's implementation happened to be marked
> as weak I assumed that was the culprit.

That and your previous message probably explains probably why I
couldn't reproduce it.

Thanks a lot for all the details -- the `cold` issue is reproducible
since gcc 4.6 at least: https://godbolt.org/z/PoxazzT9T

The `abort` case appears to happen since gcc 8.1.

Cheers,
Miguel

[0/8] arm64/ftrace: Add support for DYNAMIC_FTRACE_WITH_CALL_OPS

Message

Comments