mbox series

[RFC,bpf-next,00/16] bpf: Speed up trampoline attach

Message ID 20201022082138.2322434-1-jolsa@kernel.org
Headers show
Series bpf: Speed up trampoline attach | expand

Message

Jiri Olsa Oct. 22, 2020, 8:21 a.m. UTC
hi,
this patchset tries to speed up the attach time for trampolines
and make bpftrace faster for wildcard use cases like:

  # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); }"

Profiles show mostly ftrace backend, because we add trampoline
functions one by one and ftrace direct function registering is
quite expensive. Thus main change in this patchset is to allow
batch attach and use just single ftrace call to attach or detach
multiple ips/trampolines.

This patchset also contains other speedup changes that showed
up in profiles:

  - delayed link free
    to bypass detach cycles completely

  - kallsyms rbtree search
    change linear search to rb tree search

For clean attach workload I added also new attach selftest,
which is not meant to be merged but is used to show profile
results.

Following numbers show speedup after applying specific change
on top of the previous (and including the previous changes).

profiled with: 'perf stat -r 5 -e cycles:k,cycles:u ...'

For bpftrace:

  # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); } i:ms:10 { printf("exit\n"); exit();}"

  - base

      3,290,457,628      cycles:k         ( +-  0.27% )
        933,581,973      cycles:u         ( +-  0.20% )

      50.25 +- 4.79 seconds time elapsed  ( +-  9.53% )

  + delayed link free

      2,535,458,767      cycles:k         ( +-  0.55% )
        940,046,382      cycles:u         ( +-  0.27% )

      33.60 +- 3.27 seconds time elapsed  ( +-  9.73% )

  + kallsym rbtree search

      2,199,433,771      cycles:k         ( +-  0.55% )
        936,105,469      cycles:u         ( +-  0.37% )

      26.48 +- 3.57 seconds time elapsed  ( +- 13.49% )

  + batch support

      1,456,854,867      cycles:k         ( +-  0.57% )
        937,737,431      cycles:u         ( +-  0.13% )

      12.44 +- 2.98 seconds time elapsed  ( +- 23.95% )

  + rcu fix

      1,427,959,119      cycles:k         ( +-  0.87% )
        930,833,507      cycles:u         ( +-  0.23% )

      14.53 +- 3.51 seconds time elapsed  ( +- 24.14% )


For attach_test numbers do not show direct time speedup when
using the batch support, but show big decrease in kernel cycles.
It seems the time is spent in rcu waiting, which I tried to
address in most likely wrong rcu fix:

  # ./test_progs -t attach_test

  - base

      1,350,136,760      cycles:k         ( +-  0.07% )
         70,591,712      cycles:u         ( +-  0.26% )

      24.26 +- 2.82 seconds time elapsed  ( +- 11.62% )

  + delayed link free

        996,152,309      cycles:k         ( +-  0.37% )
         69,263,150      cycles:u         ( +-  0.50% )

      15.63 +- 1.80 seconds time elapsed  ( +- 11.51% )

  + kallsym rbtree search

        390,217,706      cycles:k         ( +-  0.66% )
         68,999,019      cycles:u         ( +-  0.46% )

      14.11 +- 2.11 seconds time elapsed  ( +- 14.98% )

  + batch support

         37,410,887      cycles:k         ( +-  0.98% )
         70,062,158      cycles:u         ( +-  0.39% )

      26.80 +- 4.10 seconds time elapsed  ( +- 15.31% )

  + rcu fix

         36,812,432      cycles:k         ( +-  2.52% )
         69,907,191      cycles:u         ( +-  0.38% )

      15.04 +- 2.94 seconds time elapsed  ( +- 19.54% )


I still need to go through the changes and double check them,
also those ftrace changes are most likely wrong and most likely
I broke few tests (hence it's RFC), but I wonder you guys would
like this batch solution and if there are any thoughts on that.

Also available in
  git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  bpf/batch

thanks,
jirka


---
Jiri Olsa (16):
      ftrace: Add check_direct_entry function
      ftrace: Add adjust_direct_size function
      ftrace: Add get/put_direct_func function
      ftrace: Add ftrace_set_filter_ips function
      ftrace: Add register_ftrace_direct_ips function
      ftrace: Add unregister_ftrace_direct_ips function
      kallsyms: Use rb tree for kallsyms name search
      bpf: Use delayed link free in bpf_link_put
      bpf: Add BPF_TRAMPOLINE_BATCH_ATTACH support
      bpf: Add BPF_TRAMPOLINE_BATCH_DETACH support
      bpf: Sync uapi bpf.h to tools
      bpf: Move synchronize_rcu_mult for batch processing (NOT TO BE MERGED)
      libbpf: Add trampoline batch attach support
      libbpf: Add trampoline batch detach support
      selftests/bpf: Add trampoline batch test
      selftests/bpf: Add attach batch test (NOT TO BE MERGED)

 include/linux/bpf.h                                       |  18 +++++-
 include/linux/ftrace.h                                    |   7 +++
 include/uapi/linux/bpf.h                                  |   8 +++
 kernel/bpf/syscall.c                                      | 125 ++++++++++++++++++++++++++++++++++----
 kernel/bpf/trampoline.c                                   |  95 +++++++++++++++++++++++------
 kernel/kallsyms.c                                         |  95 ++++++++++++++++++++++++++---
 kernel/trace/ftrace.c                                     | 304 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------
 net/bpf/test_run.c                                        |  55 +++++++++++++++++
 tools/include/uapi/linux/bpf.h                            |   8 +++
 tools/lib/bpf/bpf.c                                       |  24 ++++++++
 tools/lib/bpf/bpf.h                                       |   2 +
 tools/lib/bpf/libbpf.c                                    | 126 ++++++++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.h                                    |   5 +-
 tools/lib/bpf/libbpf.map                                  |   2 +
 tools/testing/selftests/bpf/prog_tests/attach_test.c      |  27 +++++++++
 tools/testing/selftests/bpf/prog_tests/trampoline_batch.c |  45 ++++++++++++++
 tools/testing/selftests/bpf/progs/attach_test.c           |  62 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/trampoline_batch_test.c |  75 +++++++++++++++++++++++
 18 files changed, 995 insertions(+), 88 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/attach_test.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/trampoline_batch.c
 create mode 100644 tools/testing/selftests/bpf/progs/attach_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/trampoline_batch_test.c

Comments

Steven Rostedt Oct. 22, 2020, 1:35 p.m. UTC | #1
On Thu, 22 Oct 2020 10:21:22 +0200
Jiri Olsa <jolsa@kernel.org> wrote:

> hi,

> this patchset tries to speed up the attach time for trampolines

> and make bpftrace faster for wildcard use cases like:

> 

>   # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); }"

> 

> Profiles show mostly ftrace backend, because we add trampoline

> functions one by one and ftrace direct function registering is

> quite expensive. Thus main change in this patchset is to allow

> batch attach and use just single ftrace call to attach or detach

> multiple ips/trampolines.


The issue I have with this change is that the purpose of the direct
trampoline was to give bpf access to the parameters of a function as if it
was called directly. That is, it could see the parameters of a function
quickly. I even made the direct function work if it wanted to also trace
the return code.

What the direct calls is NOT, is a generic tracing function tracer. If that
is required, then bpftrace should be registering itself with ftrace.
If you are attaching to a set of functions, where it becomes obvious that
its not being used to access specific parameters, then that's an abuse of
the direct calls.

We already have one generic function tracer, we don't need another.

-- Steve
Jiri Olsa Oct. 22, 2020, 2:11 p.m. UTC | #2
On Thu, Oct 22, 2020 at 09:35:10AM -0400, Steven Rostedt wrote:
> On Thu, 22 Oct 2020 10:21:22 +0200

> Jiri Olsa <jolsa@kernel.org> wrote:

> 

> > hi,

> > this patchset tries to speed up the attach time for trampolines

> > and make bpftrace faster for wildcard use cases like:

> > 

> >   # bpftrace -ve "kfunc:__x64_sys_s* { printf("test\n"); }"

> > 

> > Profiles show mostly ftrace backend, because we add trampoline

> > functions one by one and ftrace direct function registering is

> > quite expensive. Thus main change in this patchset is to allow

> > batch attach and use just single ftrace call to attach or detach

> > multiple ips/trampolines.

> 

> The issue I have with this change is that the purpose of the direct

> trampoline was to give bpf access to the parameters of a function as if it

> was called directly. That is, it could see the parameters of a function

> quickly. I even made the direct function work if it wanted to also trace

> the return code.

> 

> What the direct calls is NOT, is a generic tracing function tracer. If that

> is required, then bpftrace should be registering itself with ftrace.

> If you are attaching to a set of functions, where it becomes obvious that

> its not being used to access specific parameters, then that's an abuse of

> the direct calls.

> 

> We already have one generic function tracer, we don't need another.


I understand direct calls as a way that bpf trampolines and ftrace can
co-exist together - ebpf trampolines need that functionality of accessing
parameters of a function as if it was called directly and at the same
point we need to be able attach to any function and to as many functions
as we want in a fast way

the bpftrace example above did not use arguments for simplicity, but they
could have been there ... I think we could detect arguments presence in
ebpf programs and use ftrace_ops directly in case they are not needed

jirka
Steven Rostedt Oct. 22, 2020, 2:42 p.m. UTC | #3
On Thu, 22 Oct 2020 16:11:54 +0200
Jiri Olsa <jolsa@redhat.com> wrote:

> I understand direct calls as a way that bpf trampolines and ftrace can

> co-exist together - ebpf trampolines need that functionality of accessing

> parameters of a function as if it was called directly and at the same

> point we need to be able attach to any function and to as many functions

> as we want in a fast way


I was sold that bpf needed a quick and fast way to get the arguments of a
function, as the only way to do that with ftrace is to save all registers,
which, I was told was too much overhead, as if you only care about
arguments, there's much less that is needed to save.

Direct calls wasn't added so that bpf and ftrace could co-exist, it was
that for certain cases, bpf wanted a faster way to access arguments,
because it still worked with ftrace, but the saving of regs was too
strenuous.

> 

> the bpftrace example above did not use arguments for simplicity, but they

> could have been there ... I think we could detect arguments presence in

> ebpf programs and use ftrace_ops directly in case they are not needed


What I don't see, is how one would need to access arguments for a lot of
calls directly? The direct trampoline was for "one-offs", because for every
function that has a direct trampoline, it prevents kretprobes and function
graph tracer from accessing it. Before I allow a "batch" direct caller, I
need it to not break function graph tracing.

If we are going to have a way to get parameters for multiple functions, I
would then like to have that be directly part of the ftrace infrastructure.
That is, allow more than just bpf to have access to this. If it's going to
be generic, then let's have it work for all function trace users and not
just bpf.

I'd like to see how batch functions will work. I guess I need to start
looking at the bpf trampoline, to see if we can modify the ftrace
trampoline to have a quick access to parameters. It would be much more
beneficial to update the existing generic function tracer to have access to
function parameters that all users could benefit from, than to tweak a
single use case into giving this feature to a single user.

-- Steve
Steven Rostedt Oct. 22, 2020, 4:21 p.m. UTC | #4
On Thu, 22 Oct 2020 10:42:05 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I'd like to see how batch functions will work. I guess I need to start

> looking at the bpf trampoline, to see if we can modify the ftrace

> trampoline to have a quick access to parameters. It would be much more

> beneficial to update the existing generic function tracer to have access to

> function parameters that all users could benefit from, than to tweak a

> single use case into giving this feature to a single user.


Looking at the creation of the bpf trampoline, I think I can modify ftrace
to have a more flexible callback. Something that passes the callback the
following:

 the function being traced.
 a pointer to the parent caller (that could be modified)
 a pointer to the original stack frame (what the stack was when the
      function is entered)
 An array of the arguments of the function (which could also be modified)

This is a change I've been wanting to make for some time, because it would
allow function graph to be a user of function tracer, and would give
everything access to the arguments.

We would still need a helper function to store all regs to keep kprobes
working unmodified, but this would still only be done if asked.

The above change shouldn't hurt performance for either ftrace or bpf
because it appears they both do the same. If BPF wants to have a batch
processing of functions, then I think we should modify ftrace to do this
new approach instead of creating another set of function trampolines.

-- Steve
Steven Rostedt Oct. 22, 2020, 8:52 p.m. UTC | #5
On Thu, 22 Oct 2020 12:21:50 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 22 Oct 2020 10:42:05 -0400

> Steven Rostedt <rostedt@goodmis.org> wrote:

> 

> > I'd like to see how batch functions will work. I guess I need to start

> > looking at the bpf trampoline, to see if we can modify the ftrace

> > trampoline to have a quick access to parameters. It would be much more

> > beneficial to update the existing generic function tracer to have access to

> > function parameters that all users could benefit from, than to tweak a

> > single use case into giving this feature to a single user.  

> 

> Looking at the creation of the bpf trampoline, I think I can modify ftrace

> to have a more flexible callback. Something that passes the callback the

> following:

> 

>  the function being traced.

>  a pointer to the parent caller (that could be modified)

>  a pointer to the original stack frame (what the stack was when the

>       function is entered)

>  An array of the arguments of the function (which could also be modified)

> 

> This is a change I've been wanting to make for some time, because it would

> allow function graph to be a user of function tracer, and would give

> everything access to the arguments.

> 

> We would still need a helper function to store all regs to keep kprobes

> working unmodified, but this would still only be done if asked.

> 

> The above change shouldn't hurt performance for either ftrace or bpf

> because it appears they both do the same. If BPF wants to have a batch

> processing of functions, then I think we should modify ftrace to do this

> new approach instead of creating another set of function trampolines.


The below is a quick proof of concept patch I whipped up. It will always
save 6 arguments, so if BPF is really interested in just saving the bare
minimum of arguments before calling, it can still use direct. But if you
are going to have a generic callback, you'll need to save all parameters
otherwise you can corrupt the function's parameter if traced function uses
more than you save.

Which looking at the bpf trampoline code, I noticed that if the verifier
can't find the btf func, it falls back to saving 5 parameters. Which can be
a bug on x86 if the function itself uses 6 or more. If you only save 5
parameters, then call a bpf program that calls a helper function that uses
more than 5 parameters, it will likely corrupt the 6th parameter of the
function being traced.

The code in question is this:

int btf_distill_func_proto(struct bpf_verifier_log *log,
			   struct btf *btf,
			   const struct btf_type *func,
			   const char *tname,
			   struct btf_func_model *m)
{
	const struct btf_param *args;
	const struct btf_type *t;
	u32 i, nargs;
	int ret;

	if (!func) {
		/* BTF function prototype doesn't match the verifier types.
		 * Fall back to 5 u64 args.
		 */
		for (i = 0; i < 5; i++)
			m->arg_size[i] = 8;
		m->ret_size = 8;
		m->nr_args = 5;
		return 0;
	}

Shouldn't it be falling back to 6, not 5?

-- Steve

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 7edbd5ee5ed4..b65d73f430ed 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -287,6 +287,10 @@ extern void ftrace_caller_end(void);
 extern void ftrace_caller_op_ptr(void);
 extern void ftrace_regs_caller_op_ptr(void);
 extern void ftrace_regs_caller_jmp(void);
+extern void ftrace_args_caller(void);
+extern void ftrace_args_call(void);
+extern void ftrace_args_caller_end(void);
+extern void ftrace_args_caller_op_ptr(void);
 
 /* movq function_trace_op(%rip), %rdx */
 /* 0x48 0x8b 0x15 <offset-to-ftrace_trace_op (4 bytes)> */
@@ -317,7 +321,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 	unsigned long end_offset;
 	unsigned long op_offset;
 	unsigned long call_offset;
-	unsigned long jmp_offset;
+	unsigned long jmp_offset = 0;
 	unsigned long offset;
 	unsigned long npages;
 	unsigned long size;
@@ -336,12 +340,16 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
 		op_offset = (unsigned long)ftrace_regs_caller_op_ptr;
 		call_offset = (unsigned long)ftrace_regs_call;
 		jmp_offset = (unsigned long)ftrace_regs_caller_jmp;
+	} else if (ops->flags & FTRACE_OPS_FL_ARGS) {
+		start_offset = (unsigned long)ftrace_args_caller;
+		end_offset = (unsigned long)ftrace_args_caller_end;
+		op_offset = (unsigned long)ftrace_args_caller_op_ptr;
+		call_offset = (unsigned long)ftrace_args_call;
 	} else {
 		start_offset = (unsigned long)ftrace_caller;
 		end_offset = (unsigned long)ftrace_caller_end;
 		op_offset = (unsigned long)ftrace_caller_op_ptr;
 		call_offset = (unsigned long)ftrace_call;
-		jmp_offset = 0;
 	}
 
 	size = end_offset - start_offset;
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index ac3d5f22fe64..65ca634d0b37 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -176,6 +176,58 @@ SYM_INNER_LABEL_ALIGN(ftrace_stub, SYM_L_WEAK)
 	retq
 SYM_FUNC_END(ftrace_epilogue)
 
+SYM_FUNC_START(ftrace_args_caller)
+#ifdef CONFIG_FRAME_POINTER
+	push %rdp
+	movq %rsp %rdp
+# define CALLED_OFFEST (7 * 8)
+# define PARENT_OFFSET (8 * 8)
+#else
+# define CALLED_OFFSET (6 * 8)
+# define PARENT_OFFSET (7 * 8)
+#endif
+	/* save args */
+	pushq %r9
+	pushq %r8
+	pushq %rcx
+	pushq %rdx
+	pushq %rsi
+	pushq %rdi
+
+	/*
+	 * Parameters:
+	 *   Called site (function called)
+	 *   Address of parent location
+	 *   pointer to ftrace_ops
+	 *   Location of stack when function was called
+	 *   Array of arguments.
+	 */
+	movq CALLED_OFFSET(%rsp), %rdi
+	leaq PARENT_OFFSET(%rsp), %rsi
+SYM_INNER_LABEL(ftrace_args_caller_op_ptr, SYM_L_GLOBAL)
+	/* Load the ftrace_ops into the 3rd parameter */
+	movq function_trace_op(%rip), %rdx
+	movq %rsi, %rcx
+	leaq 0(%rsp), %r8
+
+SYM_INNER_LABEL(ftrace_args_call, SYM_L_GLOBAL)
+	callq ftrace_stub
+
+	popq %rdi
+	popq %rsi
+	popq %rdx
+	popq %rcx
+	popq %r8
+	popq %r9
+
+#ifdef CONFIG_FRAME_POINTER
+	popq %rdp
+#endif
+
+SYM_INNER_LABEL(ftrace_args_caller_end, SYM_L_GLOBAL)
+	jmp ftrace_epilogue
+SYM_FUNC_END(ftrace_args_caller)
+
 SYM_FUNC_START(ftrace_regs_caller)
 	/* Save the current flags before any operations that can change them */
 	pushfq
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 1bd3a0356ae4..0d077e8d7bb4 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -92,6 +92,17 @@ struct ftrace_ops;
 typedef void (*ftrace_func_t)(unsigned long ip, unsigned long parent_ip,
 			      struct ftrace_ops *op, struct pt_regs *regs);
 
+typedef void (*ftrace_args_func_t)(unsigned long ip, unsigned long *parent_ip,
+				   struct ftrace_ops *op, unsigned long *stack,
+				   unsigned long *args);
+
+union ftrace_callback {
+	ftrace_func_t		func;
+	ftrace_args_func_t	args_func;
+};
+
+typedef union ftrace_callback ftrace_callback_t;
+
 ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);
 
 /*
@@ -169,6 +180,7 @@ enum {
 	FTRACE_OPS_FL_TRACE_ARRAY		= BIT(15),
 	FTRACE_OPS_FL_PERMANENT                 = BIT(16),
 	FTRACE_OPS_FL_DIRECT			= BIT(17),
+	FTRACE_OPS_FL_ARGS			= BIT(18),
 };
 
 #ifdef CONFIG_DYNAMIC_FTRACE
@@ -447,9 +459,11 @@ enum {
 	FTRACE_FL_DISABLED	= (1UL << 25),
 	FTRACE_FL_DIRECT	= (1UL << 24),
 	FTRACE_FL_DIRECT_EN	= (1UL << 23),
+	FTRACE_FL_ARGS		= (1UL << 22),
+	FTRACE_FL_ARGS_EN	= (1UL << 21),
 };
 
-#define FTRACE_REF_MAX_SHIFT	23
+#define FTRACE_REF_MAX_SHIFT	21
 #define FTRACE_REF_MAX		((1UL << FTRACE_REF_MAX_SHIFT) - 1)
 
 #define ftrace_rec_count(rec)	((rec)->flags & FTRACE_REF_MAX)
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 4833b6a82ce7..5632b0809dc0 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -1721,6 +1721,9 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
 			if (ops->flags & FTRACE_OPS_FL_DIRECT)
 				rec->flags |= FTRACE_FL_DIRECT;
 
+			else if (ops->flags & FTRACE_OPS_FL_ARGS)
+				rec->flags |= FTRACE_FL_ARGS;
+
 			/*
 			 * If there's only a single callback registered to a
 			 * function, and the ops has a trampoline registered
@@ -1757,6 +1760,10 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,
 			if (ops->flags & FTRACE_OPS_FL_DIRECT)
 				rec->flags &= ~FTRACE_FL_DIRECT;
 
+			/* POC: but we will have more than one */
+			if (ops->flags & FTRACE_OPS_FL_ARGS)
+				rec->flags &= ~FTRACE_FL_ARGS;
+
 			/*
 			 * If the rec had REGS enabled and the ops that is
 			 * being removed had REGS set, then see if there is
@@ -2103,6 +2110,13 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 		    !(rec->flags & FTRACE_FL_TRAMP_EN))
 			flag |= FTRACE_FL_TRAMP;
 
+		/* Proof of concept */
+		if (ftrace_rec_count(rec) == 1) {
+			if (!(rec->flags & FTRACE_FL_ARGS) !=
+			    !(rec->flags & FTRACE_FL_ARGS_EN))
+				flag |= FTRACE_FL_ARGS;
+		}
+
 		/*
 		 * Direct calls are special, as count matters.
 		 * We must test the record for direct, if the
@@ -2144,6 +2158,17 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 				else
 					rec->flags &= ~FTRACE_FL_TRAMP_EN;
 			}
+			if (flag & FTRACE_FL_ARGS) {
+				if (ftrace_rec_count(rec) == 1) {
+					if (rec->flags & FTRACE_FL_ARGS)
+						rec->flags |= FTRACE_FL_ARGS_EN;
+					else
+						rec->flags &= ~FTRACE_FL_ARGS_EN;
+				} else {
+					rec->flags &= ~FTRACE_FL_ARGS_EN;
+				}
+			}
+
 			if (flag & FTRACE_FL_DIRECT) {
 				/*
 				 * If there's only one user (direct_ops helper)
@@ -2192,7 +2217,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)
 			 * and REGS states. The _EN flags must be disabled though.
 			 */
 			rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |
-					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);
+					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |
+					FTRACE_FL_ARGS_EN);
 	}
 
 	ftrace_bug_type = FTRACE_BUG_NOP;
@@ -3630,7 +3656,8 @@ static int t_show(struct seq_file *m, void *v)
 			   ftrace_rec_count(rec),
 			   rec->flags & FTRACE_FL_REGS ? " R" : "  ",
 			   rec->flags & FTRACE_FL_IPMODIFY ? " I" : "  ",
-			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ");
+			   rec->flags & FTRACE_FL_DIRECT ? " D" :
+			   rec->flags & FTRACE_FL_ARGS ? " A" : "  ");
 		if (rec->flags & FTRACE_FL_TRAMP_EN) {
 			ops = ftrace_find_tramp_ops_any(rec);
 			if (ops) {
diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c
index 2c2126e1871d..a3da84b0e599 100644
--- a/kernel/trace/trace_functions.c
+++ b/kernel/trace/trace_functions.c
@@ -86,9 +86,48 @@ void ftrace_destroy_function_files(struct trace_array *tr)
 	ftrace_free_ftrace_ops(tr);
 }
 
+static void function_args_trace_call(unsigned long ip,
+				     unsigned long *parent_ip,
+				     struct ftrace_ops *op,
+				     unsigned long *stack,
+				     unsigned long *args)
+{
+	struct trace_array *tr = op->private;
+	struct trace_array_cpu *data;
+	unsigned long flags;
+	int bit;
+	int cpu;
+	int pc;
+
+	if (unlikely(!tr->function_enabled))
+		return;
+
+	pc = preempt_count();
+	preempt_disable_notrace();
+
+	bit = trace_test_and_set_recursion(TRACE_FTRACE_START, TRACE_FTRACE_MAX);
+	if (bit < 0)
+		goto out;
+
+	cpu = smp_processor_id();
+	data = per_cpu_ptr(tr->array_buffer.data, cpu);
+	if (!atomic_read(&data->disabled)) {
+		local_save_flags(flags);
+		trace_function(tr, ip, *parent_ip, flags, pc);
+		trace_printk("%pS %lx %lx %lx %lx %lx %lx\n",
+			     (void *)ip, args[0], args[1], args[2], args[3],
+			     args[4], args[5]);
+	}
+	trace_clear_recursion(bit);
+
+ out:
+	preempt_enable_notrace();
+
+}
+
 static int function_trace_init(struct trace_array *tr)
 {
-	ftrace_func_t func;
+	ftrace_callback_t callback;
 
 	/*
 	 * Instance trace_arrays get their ops allocated
@@ -101,11 +140,14 @@ static int function_trace_init(struct trace_array *tr)
 	/* Currently only the global instance can do stack tracing */
 	if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&
 	    func_flags.val & TRACE_FUNC_OPT_STACK)
-		func = function_stack_trace_call;
-	else
-		func = function_trace_call;
+		callback.func = function_stack_trace_call;
+	else {
+		tr->ops->flags |= FTRACE_OPS_FL_ARGS;
+		callback.args_func = function_args_trace_call;
+	}
+//		func = function_trace_call;
 
-	ftrace_init_array_ops(tr, func);
+	ftrace_init_array_ops(tr, callback.func);
 
 	tr->array_buffer.cpu = get_cpu();
 	put_cpu();
Jiri Olsa Oct. 23, 2020, 6:09 a.m. UTC | #6
On Thu, Oct 22, 2020 at 04:52:29PM -0400, Steven Rostedt wrote:
> On Thu, 22 Oct 2020 12:21:50 -0400

> Steven Rostedt <rostedt@goodmis.org> wrote:

> 

> > On Thu, 22 Oct 2020 10:42:05 -0400

> > Steven Rostedt <rostedt@goodmis.org> wrote:

> > 

> > > I'd like to see how batch functions will work. I guess I need to start

> > > looking at the bpf trampoline, to see if we can modify the ftrace

> > > trampoline to have a quick access to parameters. It would be much more

> > > beneficial to update the existing generic function tracer to have access to

> > > function parameters that all users could benefit from, than to tweak a

> > > single use case into giving this feature to a single user.  

> > 

> > Looking at the creation of the bpf trampoline, I think I can modify ftrace

> > to have a more flexible callback. Something that passes the callback the

> > following:

> > 

> >  the function being traced.

> >  a pointer to the parent caller (that could be modified)

> >  a pointer to the original stack frame (what the stack was when the

> >       function is entered)

> >  An array of the arguments of the function (which could also be modified)

> > 

> > This is a change I've been wanting to make for some time, because it would

> > allow function graph to be a user of function tracer, and would give

> > everything access to the arguments.

> > 

> > We would still need a helper function to store all regs to keep kprobes

> > working unmodified, but this would still only be done if asked.

> > 

> > The above change shouldn't hurt performance for either ftrace or bpf

> > because it appears they both do the same. If BPF wants to have a batch

> > processing of functions, then I think we should modify ftrace to do this

> > new approach instead of creating another set of function trampolines.

> 

> The below is a quick proof of concept patch I whipped up. It will always

> save 6 arguments, so if BPF is really interested in just saving the bare

> minimum of arguments before calling, it can still use direct. But if you

> are going to have a generic callback, you'll need to save all parameters

> otherwise you can corrupt the function's parameter if traced function uses

> more than you save.


nice, I'll take a look, thanks for quick code ;-)

> 

> Which looking at the bpf trampoline code, I noticed that if the verifier

> can't find the btf func, it falls back to saving 5 parameters. Which can be

> a bug on x86 if the function itself uses 6 or more. If you only save 5

> parameters, then call a bpf program that calls a helper function that uses

> more than 5 parameters, it will likely corrupt the 6th parameter of the

> function being traced.


number of args from eBPF program to in-kernel function is
restricted to 5, so we should be fine

> 

> The code in question is this:

> 

> int btf_distill_func_proto(struct bpf_verifier_log *log,

> 			   struct btf *btf,

> 			   const struct btf_type *func,

> 			   const char *tname,

> 			   struct btf_func_model *m)

> {

> 	const struct btf_param *args;

> 	const struct btf_type *t;

> 	u32 i, nargs;

> 	int ret;

> 

> 	if (!func) {

> 		/* BTF function prototype doesn't match the verifier types.

> 		 * Fall back to 5 u64 args.

> 		 */

> 		for (i = 0; i < 5; i++)

> 			m->arg_size[i] = 8;

> 		m->ret_size = 8;

> 		m->nr_args = 5;

> 		return 0;

> 	}

> 

> Shouldn't it be falling back to 6, not 5?


but looks like this actualy could fallback to 6, jit would
allow that, but I'm not sure if there's another restriction

thanks,
jirka

> 

> -- Steve

> 

> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c

> index 7edbd5ee5ed4..b65d73f430ed 100644

> --- a/arch/x86/kernel/ftrace.c

> +++ b/arch/x86/kernel/ftrace.c

> @@ -287,6 +287,10 @@ extern void ftrace_caller_end(void);

>  extern void ftrace_caller_op_ptr(void);

>  extern void ftrace_regs_caller_op_ptr(void);

>  extern void ftrace_regs_caller_jmp(void);

> +extern void ftrace_args_caller(void);

> +extern void ftrace_args_call(void);

> +extern void ftrace_args_caller_end(void);

> +extern void ftrace_args_caller_op_ptr(void);

>  

>  /* movq function_trace_op(%rip), %rdx */

>  /* 0x48 0x8b 0x15 <offset-to-ftrace_trace_op (4 bytes)> */

> @@ -317,7 +321,7 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)

>  	unsigned long end_offset;

>  	unsigned long op_offset;

>  	unsigned long call_offset;

> -	unsigned long jmp_offset;

> +	unsigned long jmp_offset = 0;

>  	unsigned long offset;

>  	unsigned long npages;

>  	unsigned long size;

> @@ -336,12 +340,16 @@ create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)

>  		op_offset = (unsigned long)ftrace_regs_caller_op_ptr;

>  		call_offset = (unsigned long)ftrace_regs_call;

>  		jmp_offset = (unsigned long)ftrace_regs_caller_jmp;

> +	} else if (ops->flags & FTRACE_OPS_FL_ARGS) {

> +		start_offset = (unsigned long)ftrace_args_caller;

> +		end_offset = (unsigned long)ftrace_args_caller_end;

> +		op_offset = (unsigned long)ftrace_args_caller_op_ptr;

> +		call_offset = (unsigned long)ftrace_args_call;

>  	} else {

>  		start_offset = (unsigned long)ftrace_caller;

>  		end_offset = (unsigned long)ftrace_caller_end;

>  		op_offset = (unsigned long)ftrace_caller_op_ptr;

>  		call_offset = (unsigned long)ftrace_call;

> -		jmp_offset = 0;

>  	}

>  

>  	size = end_offset - start_offset;

> diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S

> index ac3d5f22fe64..65ca634d0b37 100644

> --- a/arch/x86/kernel/ftrace_64.S

> +++ b/arch/x86/kernel/ftrace_64.S

> @@ -176,6 +176,58 @@ SYM_INNER_LABEL_ALIGN(ftrace_stub, SYM_L_WEAK)

>  	retq

>  SYM_FUNC_END(ftrace_epilogue)

>  

> +SYM_FUNC_START(ftrace_args_caller)

> +#ifdef CONFIG_FRAME_POINTER

> +	push %rdp

> +	movq %rsp %rdp

> +# define CALLED_OFFEST (7 * 8)

> +# define PARENT_OFFSET (8 * 8)

> +#else

> +# define CALLED_OFFSET (6 * 8)

> +# define PARENT_OFFSET (7 * 8)

> +#endif

> +	/* save args */

> +	pushq %r9

> +	pushq %r8

> +	pushq %rcx

> +	pushq %rdx

> +	pushq %rsi

> +	pushq %rdi

> +

> +	/*

> +	 * Parameters:

> +	 *   Called site (function called)

> +	 *   Address of parent location

> +	 *   pointer to ftrace_ops

> +	 *   Location of stack when function was called

> +	 *   Array of arguments.

> +	 */

> +	movq CALLED_OFFSET(%rsp), %rdi

> +	leaq PARENT_OFFSET(%rsp), %rsi

> +SYM_INNER_LABEL(ftrace_args_caller_op_ptr, SYM_L_GLOBAL)

> +	/* Load the ftrace_ops into the 3rd parameter */

> +	movq function_trace_op(%rip), %rdx

> +	movq %rsi, %rcx

> +	leaq 0(%rsp), %r8

> +

> +SYM_INNER_LABEL(ftrace_args_call, SYM_L_GLOBAL)

> +	callq ftrace_stub

> +

> +	popq %rdi

> +	popq %rsi

> +	popq %rdx

> +	popq %rcx

> +	popq %r8

> +	popq %r9

> +

> +#ifdef CONFIG_FRAME_POINTER

> +	popq %rdp

> +#endif

> +

> +SYM_INNER_LABEL(ftrace_args_caller_end, SYM_L_GLOBAL)

> +	jmp ftrace_epilogue

> +SYM_FUNC_END(ftrace_args_caller)

> +

>  SYM_FUNC_START(ftrace_regs_caller)

>  	/* Save the current flags before any operations that can change them */

>  	pushfq

> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h

> index 1bd3a0356ae4..0d077e8d7bb4 100644

> --- a/include/linux/ftrace.h

> +++ b/include/linux/ftrace.h

> @@ -92,6 +92,17 @@ struct ftrace_ops;

>  typedef void (*ftrace_func_t)(unsigned long ip, unsigned long parent_ip,

>  			      struct ftrace_ops *op, struct pt_regs *regs);

>  

> +typedef void (*ftrace_args_func_t)(unsigned long ip, unsigned long *parent_ip,

> +				   struct ftrace_ops *op, unsigned long *stack,

> +				   unsigned long *args);

> +

> +union ftrace_callback {

> +	ftrace_func_t		func;

> +	ftrace_args_func_t	args_func;

> +};

> +

> +typedef union ftrace_callback ftrace_callback_t;

> +

>  ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops);

>  

>  /*

> @@ -169,6 +180,7 @@ enum {

>  	FTRACE_OPS_FL_TRACE_ARRAY		= BIT(15),

>  	FTRACE_OPS_FL_PERMANENT                 = BIT(16),

>  	FTRACE_OPS_FL_DIRECT			= BIT(17),

> +	FTRACE_OPS_FL_ARGS			= BIT(18),

>  };

>  

>  #ifdef CONFIG_DYNAMIC_FTRACE

> @@ -447,9 +459,11 @@ enum {

>  	FTRACE_FL_DISABLED	= (1UL << 25),

>  	FTRACE_FL_DIRECT	= (1UL << 24),

>  	FTRACE_FL_DIRECT_EN	= (1UL << 23),

> +	FTRACE_FL_ARGS		= (1UL << 22),

> +	FTRACE_FL_ARGS_EN	= (1UL << 21),

>  };

>  

> -#define FTRACE_REF_MAX_SHIFT	23

> +#define FTRACE_REF_MAX_SHIFT	21

>  #define FTRACE_REF_MAX		((1UL << FTRACE_REF_MAX_SHIFT) - 1)

>  

>  #define ftrace_rec_count(rec)	((rec)->flags & FTRACE_REF_MAX)

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c

> index 4833b6a82ce7..5632b0809dc0 100644

> --- a/kernel/trace/ftrace.c

> +++ b/kernel/trace/ftrace.c

> @@ -1721,6 +1721,9 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,

>  			if (ops->flags & FTRACE_OPS_FL_DIRECT)

>  				rec->flags |= FTRACE_FL_DIRECT;

>  

> +			else if (ops->flags & FTRACE_OPS_FL_ARGS)

> +				rec->flags |= FTRACE_FL_ARGS;

> +

>  			/*

>  			 * If there's only a single callback registered to a

>  			 * function, and the ops has a trampoline registered

> @@ -1757,6 +1760,10 @@ static bool __ftrace_hash_rec_update(struct ftrace_ops *ops,

>  			if (ops->flags & FTRACE_OPS_FL_DIRECT)

>  				rec->flags &= ~FTRACE_FL_DIRECT;

>  

> +			/* POC: but we will have more than one */

> +			if (ops->flags & FTRACE_OPS_FL_ARGS)

> +				rec->flags &= ~FTRACE_FL_ARGS;

> +

>  			/*

>  			 * If the rec had REGS enabled and the ops that is

>  			 * being removed had REGS set, then see if there is

> @@ -2103,6 +2110,13 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)

>  		    !(rec->flags & FTRACE_FL_TRAMP_EN))

>  			flag |= FTRACE_FL_TRAMP;

>  

> +		/* Proof of concept */

> +		if (ftrace_rec_count(rec) == 1) {

> +			if (!(rec->flags & FTRACE_FL_ARGS) !=

> +			    !(rec->flags & FTRACE_FL_ARGS_EN))

> +				flag |= FTRACE_FL_ARGS;

> +		}

> +

>  		/*

>  		 * Direct calls are special, as count matters.

>  		 * We must test the record for direct, if the

> @@ -2144,6 +2158,17 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)

>  				else

>  					rec->flags &= ~FTRACE_FL_TRAMP_EN;

>  			}

> +			if (flag & FTRACE_FL_ARGS) {

> +				if (ftrace_rec_count(rec) == 1) {

> +					if (rec->flags & FTRACE_FL_ARGS)

> +						rec->flags |= FTRACE_FL_ARGS_EN;

> +					else

> +						rec->flags &= ~FTRACE_FL_ARGS_EN;

> +				} else {

> +					rec->flags &= ~FTRACE_FL_ARGS_EN;

> +				}

> +			}

> +

>  			if (flag & FTRACE_FL_DIRECT) {

>  				/*

>  				 * If there's only one user (direct_ops helper)

> @@ -2192,7 +2217,8 @@ static int ftrace_check_record(struct dyn_ftrace *rec, bool enable, bool update)

>  			 * and REGS states. The _EN flags must be disabled though.

>  			 */

>  			rec->flags &= ~(FTRACE_FL_ENABLED | FTRACE_FL_TRAMP_EN |

> -					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN);

> +					FTRACE_FL_REGS_EN | FTRACE_FL_DIRECT_EN |

> +					FTRACE_FL_ARGS_EN);

>  	}

>  

>  	ftrace_bug_type = FTRACE_BUG_NOP;

> @@ -3630,7 +3656,8 @@ static int t_show(struct seq_file *m, void *v)

>  			   ftrace_rec_count(rec),

>  			   rec->flags & FTRACE_FL_REGS ? " R" : "  ",

>  			   rec->flags & FTRACE_FL_IPMODIFY ? " I" : "  ",

> -			   rec->flags & FTRACE_FL_DIRECT ? " D" : "  ");

> +			   rec->flags & FTRACE_FL_DIRECT ? " D" :

> +			   rec->flags & FTRACE_FL_ARGS ? " A" : "  ");

>  		if (rec->flags & FTRACE_FL_TRAMP_EN) {

>  			ops = ftrace_find_tramp_ops_any(rec);

>  			if (ops) {

> diff --git a/kernel/trace/trace_functions.c b/kernel/trace/trace_functions.c

> index 2c2126e1871d..a3da84b0e599 100644

> --- a/kernel/trace/trace_functions.c

> +++ b/kernel/trace/trace_functions.c

> @@ -86,9 +86,48 @@ void ftrace_destroy_function_files(struct trace_array *tr)

>  	ftrace_free_ftrace_ops(tr);

>  }

>  

> +static void function_args_trace_call(unsigned long ip,

> +				     unsigned long *parent_ip,

> +				     struct ftrace_ops *op,

> +				     unsigned long *stack,

> +				     unsigned long *args)

> +{

> +	struct trace_array *tr = op->private;

> +	struct trace_array_cpu *data;

> +	unsigned long flags;

> +	int bit;

> +	int cpu;

> +	int pc;

> +

> +	if (unlikely(!tr->function_enabled))

> +		return;

> +

> +	pc = preempt_count();

> +	preempt_disable_notrace();

> +

> +	bit = trace_test_and_set_recursion(TRACE_FTRACE_START, TRACE_FTRACE_MAX);

> +	if (bit < 0)

> +		goto out;

> +

> +	cpu = smp_processor_id();

> +	data = per_cpu_ptr(tr->array_buffer.data, cpu);

> +	if (!atomic_read(&data->disabled)) {

> +		local_save_flags(flags);

> +		trace_function(tr, ip, *parent_ip, flags, pc);

> +		trace_printk("%pS %lx %lx %lx %lx %lx %lx\n",

> +			     (void *)ip, args[0], args[1], args[2], args[3],

> +			     args[4], args[5]);

> +	}

> +	trace_clear_recursion(bit);

> +

> + out:

> +	preempt_enable_notrace();

> +

> +}

> +

>  static int function_trace_init(struct trace_array *tr)

>  {

> -	ftrace_func_t func;

> +	ftrace_callback_t callback;

>  

>  	/*

>  	 * Instance trace_arrays get their ops allocated

> @@ -101,11 +140,14 @@ static int function_trace_init(struct trace_array *tr)

>  	/* Currently only the global instance can do stack tracing */

>  	if (tr->flags & TRACE_ARRAY_FL_GLOBAL &&

>  	    func_flags.val & TRACE_FUNC_OPT_STACK)

> -		func = function_stack_trace_call;

> -	else

> -		func = function_trace_call;

> +		callback.func = function_stack_trace_call;

> +	else {

> +		tr->ops->flags |= FTRACE_OPS_FL_ARGS;

> +		callback.args_func = function_args_trace_call;

> +	}

> +//		func = function_trace_call;

>  

> -	ftrace_init_array_ops(tr, func);

> +	ftrace_init_array_ops(tr, callback.func);

>  

>  	tr->array_buffer.cpu = get_cpu();

>  	put_cpu();

>
Steven Rostedt Oct. 23, 2020, 1:50 p.m. UTC | #7
On Fri, 23 Oct 2020 08:09:32 +0200
Jiri Olsa <jolsa@redhat.com> wrote:
> > 

> > The below is a quick proof of concept patch I whipped up. It will always

> > save 6 arguments, so if BPF is really interested in just saving the bare

> > minimum of arguments before calling, it can still use direct. But if you

> > are going to have a generic callback, you'll need to save all parameters

> > otherwise you can corrupt the function's parameter if traced function uses

> > more than you save.  

> 

> nice, I'll take a look, thanks for quick code ;-)


Playing with it more, I have it where I don't add a new ARGS flag, but just
make that the default (if arch supports it). And with this change, I even
have function graph working directly with the function tracer, and I can
now get rid of the function graph trampoline! Of course, this will be a
slow process because it has to be changed across architectures, but with a
HAVE_FTRACE_ARGS flag, I can do it one by one.

> 

> > 

> > Which looking at the bpf trampoline code, I noticed that if the verifier

> > can't find the btf func, it falls back to saving 5 parameters. Which can be

> > a bug on x86 if the function itself uses 6 or more. If you only save 5

> > parameters, then call a bpf program that calls a helper function that uses

> > more than 5 parameters, it will likely corrupt the 6th parameter of the

> > function being traced.  

> 

> number of args from eBPF program to in-kernel function is

> restricted to 5, so we should be fine


Is there something to keep an eBPF program from tracing a function with 6
args? If the program saves only 5 args, but traces a function that has 6
args, then the tracing program may end up using the register that the 6
argument is in, and corrupting it.

I'm looking at bpf/trampoline.c, that has:

	arch_prepare_bpf_trampoline(new_image, ...)

and that new_image is passed into:

	register_ftrace_direct(ip, new_addr);

where new_addr == new_image.

And I don't see anywhere in the creating on that new_image that saves the
6th parameter.

The bpf program calls some helper functions which are allowed to clobber
%r9 (where the 6th parameter is stored on x86_64). That means, when it
returns to the function it traced, the 6th parameter is no longer correct.

At a minimum, direct callers must save all the parameters used by the
function, not just what the eBPF code may use.

> 

> > 

> > The code in question is this:

> > 

> > int btf_distill_func_proto(struct bpf_verifier_log *log,

> > 			   struct btf *btf,

> > 			   const struct btf_type *func,

> > 			   const char *tname,

> > 			   struct btf_func_model *m)

> > {

> > 	const struct btf_param *args;

> > 	const struct btf_type *t;

> > 	u32 i, nargs;

> > 	int ret;

> > 

> > 	if (!func) {

> > 		/* BTF function prototype doesn't match the verifier types.

> > 		 * Fall back to 5 u64 args.

> > 		 */

> > 		for (i = 0; i < 5; i++)

> > 			m->arg_size[i] = 8;

> > 		m->ret_size = 8;

> > 		m->nr_args = 5;

> > 		return 0;

> > 	}

> > 

> > Shouldn't it be falling back to 6, not 5?  

> 

> but looks like this actualy could fallback to 6, jit would

> allow that, but I'm not sure if there's another restriction



Either way, the direct trampoline must save all registers used by
parameters of the function, and if it doesn't know how many parameters are
used, it must save all possible ones (like ftrace does).

-- Steve
Jiri Olsa Oct. 25, 2020, 7:01 p.m. UTC | #8
On Fri, Oct 23, 2020 at 09:50:20AM -0400, Steven Rostedt wrote:

SNIP

> Is there something to keep an eBPF program from tracing a function with 6

> args? If the program saves only 5 args, but traces a function that has 6

> args, then the tracing program may end up using the register that the 6

> argument is in, and corrupting it.

> 

> I'm looking at bpf/trampoline.c, that has:

> 

> 	arch_prepare_bpf_trampoline(new_image, ...)

> 

> and that new_image is passed into:

> 

> 	register_ftrace_direct(ip, new_addr);

> 

> where new_addr == new_image.

> 

> And I don't see anywhere in the creating on that new_image that saves the

> 6th parameter.


  arch_prepare_bpf_trampoline
    ...
    save_regs(m, &prog, nr_args, stack_size);

> 

> The bpf program calls some helper functions which are allowed to clobber

> %r9 (where the 6th parameter is stored on x86_64). That means, when it

> returns to the function it traced, the 6th parameter is no longer correct.

> 

> At a minimum, direct callers must save all the parameters used by the

> function, not just what the eBPF code may use.

> 

> > 

> > > 

> > > The code in question is this:

> > > 

> > > int btf_distill_func_proto(struct bpf_verifier_log *log,

> > > 			   struct btf *btf,

> > > 			   const struct btf_type *func,

> > > 			   const char *tname,

> > > 			   struct btf_func_model *m)

> > > {

> > > 	const struct btf_param *args;

> > > 	const struct btf_type *t;

> > > 	u32 i, nargs;

> > > 	int ret;

> > > 

> > > 	if (!func) {

> > > 		/* BTF function prototype doesn't match the verifier types.

> > > 		 * Fall back to 5 u64 args.

> > > 		 */

> > > 		for (i = 0; i < 5; i++)

> > > 			m->arg_size[i] = 8;

> > > 		m->ret_size = 8;

> > > 		m->nr_args = 5;

> > > 		return 0;

> > > 	}


the fallback code in btf_distill_func_proto you're reffering to
is for case of tracing another ebpf program, when hooking to
kernel function, all args are used with no fallback to 5 args

I'm not sure what are the rules wrt args count when tracing
another ebpf program

jirka
Alexei Starovoitov Oct. 27, 2020, 4:30 a.m. UTC | #9
On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:
> On Thu, 22 Oct 2020 16:11:54 +0200
> Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > I understand direct calls as a way that bpf trampolines and ftrace can
> > co-exist together - ebpf trampolines need that functionality of accessing
> > parameters of a function as if it was called directly and at the same
> > point we need to be able attach to any function and to as many functions
> > as we want in a fast way
> 
> I was sold that bpf needed a quick and fast way to get the arguments of a
> function, as the only way to do that with ftrace is to save all registers,
> which, I was told was too much overhead, as if you only care about
> arguments, there's much less that is needed to save.
> 
> Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> that for certain cases, bpf wanted a faster way to access arguments,
> because it still worked with ftrace, but the saving of regs was too
> strenuous.

Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
There is no other use for it.

Jiri,
could you please redo your benchmarking hardcoding ftrace_managed=false ?
If going through register_ftrace_direct() is indeed so much slower
than arch_text_poke() then something gotta give.
Either register_ftrace_direct() has to become faster or users
have to give up on co-existing of bpf and ftrace.
So far not a single user cared about using trampoline and ftrace together.
So the latter is certainly an option.

Regardless, the patch 7 (rbtree of kallsyms) is probably good on its own.
Can you benchmark it independently and maybe resubmit if it's useful
without other patches?
Steven Rostedt Oct. 27, 2020, 1:14 p.m. UTC | #10
On Mon, 26 Oct 2020 21:30:14 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> > Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> > that for certain cases, bpf wanted a faster way to access arguments,
> > because it still worked with ftrace, but the saving of regs was too
> > strenuous.  
> 
> Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
> There is no other use for it.

What does that even mean? And I'm guessing when you say "trampoline"
you mean a "bpf trampoline" because "trampoline" is used for a lot more
than bpf, and bpf does not own that term.

Do you mean, "direct calls in ftrace were done so that bpf trampolines
could work". Remember, ftrace has a lot of users, and it must remain
backward compatible.

-- Steve
Jiri Olsa Oct. 27, 2020, 2:28 p.m. UTC | #11
On Mon, Oct 26, 2020 at 09:30:14PM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:

> > On Thu, 22 Oct 2020 16:11:54 +0200

> > Jiri Olsa <jolsa@redhat.com> wrote:

> > 

> > > I understand direct calls as a way that bpf trampolines and ftrace can

> > > co-exist together - ebpf trampolines need that functionality of accessing

> > > parameters of a function as if it was called directly and at the same

> > > point we need to be able attach to any function and to as many functions

> > > as we want in a fast way

> > 

> > I was sold that bpf needed a quick and fast way to get the arguments of a

> > function, as the only way to do that with ftrace is to save all registers,

> > which, I was told was too much overhead, as if you only care about

> > arguments, there's much less that is needed to save.

> > 

> > Direct calls wasn't added so that bpf and ftrace could co-exist, it was

> > that for certain cases, bpf wanted a faster way to access arguments,

> > because it still worked with ftrace, but the saving of regs was too

> > strenuous.

> 

> Direct calls in ftrace were done so that ftrace and trampoline can co-exist.

> There is no other use for it.

> 

> Jiri,

> could you please redo your benchmarking hardcoding ftrace_managed=false ?

> If going through register_ftrace_direct() is indeed so much slower

> than arch_text_poke() then something gotta give.

> Either register_ftrace_direct() has to become faster or users

> have to give up on co-existing of bpf and ftrace.

> So far not a single user cared about using trampoline and ftrace together.

> So the latter is certainly an option.


I tried that, and IIRC it was not much faster, but I don't have details
on that.. but it should be quick check, I'll do it

anyway later I realized that for us we need ftrace to stay, so I abandoned
this idea ;-) and started to check on how to keep them both together and
just make it faster

also currently bpf trampolines will not work without ftrace being
enabled, because ftrace is doing the preparation work during compile,
and replaces all the fentry calls with nop instructions and the
replace code depends on those nops...  so if we go this way, we would
need to make this preparation code generic

> 

> Regardless, the patch 7 (rbtree of kallsyms) is probably good on its own.

> Can you benchmark it independently and maybe resubmit if it's useful

> without other patches?


yes, I'll submit that in separate patch

thanks,
jirka
Alexei Starovoitov Oct. 28, 2020, 9:13 p.m. UTC | #12
On Tue, Oct 27, 2020 at 03:28:03PM +0100, Jiri Olsa wrote:
> On Mon, Oct 26, 2020 at 09:30:14PM -0700, Alexei Starovoitov wrote:
> > On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:
> > > On Thu, 22 Oct 2020 16:11:54 +0200
> > > Jiri Olsa <jolsa@redhat.com> wrote:
> > > 
> > > > I understand direct calls as a way that bpf trampolines and ftrace can
> > > > co-exist together - ebpf trampolines need that functionality of accessing
> > > > parameters of a function as if it was called directly and at the same
> > > > point we need to be able attach to any function and to as many functions
> > > > as we want in a fast way
> > > 
> > > I was sold that bpf needed a quick and fast way to get the arguments of a
> > > function, as the only way to do that with ftrace is to save all registers,
> > > which, I was told was too much overhead, as if you only care about
> > > arguments, there's much less that is needed to save.
> > > 
> > > Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> > > that for certain cases, bpf wanted a faster way to access arguments,
> > > because it still worked with ftrace, but the saving of regs was too
> > > strenuous.
> > 
> > Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
> > There is no other use for it.
> > 
> > Jiri,
> > could you please redo your benchmarking hardcoding ftrace_managed=false ?
> > If going through register_ftrace_direct() is indeed so much slower
> > than arch_text_poke() then something gotta give.
> > Either register_ftrace_direct() has to become faster or users
> > have to give up on co-existing of bpf and ftrace.
> > So far not a single user cared about using trampoline and ftrace together.
> > So the latter is certainly an option.
> 
> I tried that, and IIRC it was not much faster, but I don't have details
> on that.. but it should be quick check, I'll do it
> 
> anyway later I realized that for us we need ftrace to stay, so I abandoned
> this idea ;-) and started to check on how to keep them both together and
> just make it faster
> 
> also currently bpf trampolines will not work without ftrace being
> enabled, because ftrace is doing the preparation work during compile,
> and replaces all the fentry calls with nop instructions and the
> replace code depends on those nops...  so if we go this way, we would
> need to make this preparation code generic

I didn't mean that part.
I was talking about register_ftrace_direct() only.
Could you please still do ftrace_managed=false experiment?
Sounds like the time to attach/detach will stay the same?
If so, then don't touch ftrace internals then. What's the point?
Jiri Olsa Oct. 29, 2020, 11:09 a.m. UTC | #13
On Wed, Oct 28, 2020 at 02:13:25PM -0700, Alexei Starovoitov wrote:
> On Tue, Oct 27, 2020 at 03:28:03PM +0100, Jiri Olsa wrote:
> > On Mon, Oct 26, 2020 at 09:30:14PM -0700, Alexei Starovoitov wrote:
> > > On Thu, Oct 22, 2020 at 10:42:05AM -0400, Steven Rostedt wrote:
> > > > On Thu, 22 Oct 2020 16:11:54 +0200
> > > > Jiri Olsa <jolsa@redhat.com> wrote:
> > > > 
> > > > > I understand direct calls as a way that bpf trampolines and ftrace can
> > > > > co-exist together - ebpf trampolines need that functionality of accessing
> > > > > parameters of a function as if it was called directly and at the same
> > > > > point we need to be able attach to any function and to as many functions
> > > > > as we want in a fast way
> > > > 
> > > > I was sold that bpf needed a quick and fast way to get the arguments of a
> > > > function, as the only way to do that with ftrace is to save all registers,
> > > > which, I was told was too much overhead, as if you only care about
> > > > arguments, there's much less that is needed to save.
> > > > 
> > > > Direct calls wasn't added so that bpf and ftrace could co-exist, it was
> > > > that for certain cases, bpf wanted a faster way to access arguments,
> > > > because it still worked with ftrace, but the saving of regs was too
> > > > strenuous.
> > > 
> > > Direct calls in ftrace were done so that ftrace and trampoline can co-exist.
> > > There is no other use for it.
> > > 
> > > Jiri,
> > > could you please redo your benchmarking hardcoding ftrace_managed=false ?
> > > If going through register_ftrace_direct() is indeed so much slower
> > > than arch_text_poke() then something gotta give.
> > > Either register_ftrace_direct() has to become faster or users
> > > have to give up on co-existing of bpf and ftrace.
> > > So far not a single user cared about using trampoline and ftrace together.
> > > So the latter is certainly an option.
> > 
> > I tried that, and IIRC it was not much faster, but I don't have details
> > on that.. but it should be quick check, I'll do it
> > 
> > anyway later I realized that for us we need ftrace to stay, so I abandoned
> > this idea ;-) and started to check on how to keep them both together and
> > just make it faster
> > 
> > also currently bpf trampolines will not work without ftrace being
> > enabled, because ftrace is doing the preparation work during compile,
> > and replaces all the fentry calls with nop instructions and the
> > replace code depends on those nops...  so if we go this way, we would
> > need to make this preparation code generic
> 
> I didn't mean that part.
> I was talking about register_ftrace_direct() only.
> Could you please still do ftrace_managed=false experiment?
> Sounds like the time to attach/detach will stay the same?
> If so, then don't touch ftrace internals then. What's the point?

actually, there's some speedup.. by running:

  # perf stat --table -e cycles:k,cycles:u -r 10 ./src/bpftrace -ve 'kfunc:__x64_sys_s* { } i:ms:10 { print("exit\n"); exit();}'

I've got following numbers on base:

     3,463,157,566      cycles:k                                                      ( +-  0.14% )
     1,164,026,270      cycles:u                                                      ( +-  0.29% )

             # Table of individual measurements:
             37.61 (-12.20) #######
             49.35 (-0.46) #
             54.03 (+4.22) ##
             50.82 (+1.01) #
             46.87 (-2.94) ##
             53.10 (+3.29) ##
             58.27 (+8.46) ###
             64.85 (+15.04) #####
             47.37 (-2.44) ##
             35.83 (-13.98) ########

             # Final result:
             49.81 +- 2.76 seconds time elapsed  ( +-  5.54% )


and following numbers with the patch below:

     2,037,364,413      cycles:k        ( +-  0.52% )
     1,164,769,939      cycles:u        ( +-  0.19% )

             # Table of individual measurements:
             30.52 (-8.54) ######
             43.43 (+4.37) ###
             43.72 (+4.66) ###
             35.70 (-3.36) ##
             40.70 (+1.63) #
             43.51 (+4.44) ###
             26.44 (-12.62) ##########
             40.21 (+1.14) #
             43.32 (+4.25) ##
             43.09 (+4.03) ##

             # Final result:
             39.06 +- 1.95 seconds time elapsed  ( +-  4.99% )


it looks like even ftrace_managed=false could be faster
with batch update, which is not used, but there's support
for it via text_poke_bp_batch function

jirka


---
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 35c5887d82ff..0a241e6eac7d 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -111,6 +111,8 @@ static int is_ftrace_location(void *ip)
 {
 	long addr;
 
+	return 0;
+
 	addr = ftrace_location((long)ip);
 	if (!addr)
 		return 0;