diff mbox series

[bpf-next,v5,1/6] bpf/helpers: introduce sleepable bpf_timers

Message ID 20240322-hid-bpf-sleepable-v5-1-179c7b59eaaa@kernel.org
State New
Headers show
Series sleepable bpf_timer (was: allow HID-BPF to do device IOs) | expand

Commit Message

Benjamin Tissoires March 22, 2024, 2:56 p.m. UTC
They are implemented as a workqueue, which means that there are no
guarantees of timing nor ordering.

Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>

---

no changes in v5

changes in v4:
- dropped __bpf_timer_compute_key()
- use a spin_lock instead of a semaphore
- ensure bpf_timer_cancel_and_free is not complaining about
  non sleepable context and use cancel_work() instead of
  cancel_work_sync()
- return -EINVAL if a delay is given to bpf_timer_start() with
  BPF_F_TIMER_SLEEPABLE

changes in v3:
- extracted the implementation in bpf_timer only, without
  bpf_timer_set_sleepable_cb()
- rely on schedule_work() only, from bpf_timer_start()
- add semaphore to ensure bpf_timer_work_cb() is accessing
  consistent data

changes in v2 (compared to the one attaches to v1 0/9):
- make use of a kfunc
- add a (non-used) BPF_F_TIMER_SLEEPABLE
- the callback is *not* called, it makes the kernel crashes
---
 include/uapi/linux/bpf.h |  4 +++
 kernel/bpf/helpers.c     | 86 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 88 insertions(+), 2 deletions(-)

Comments

Benjamin Tissoires March 27, 2024, 5:02 p.m. UTC | #1
On Mon, Mar 25, 2024 at 1:50 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Mar 22, 2024 at 7:56 AM Benjamin Tissoires <bentiss@kernel.org> wrote:
> >
> > They are implemented as a workqueue, which means that there are no
> > guarantees of timing nor ordering.
> >
> > Signed-off-by: Benjamin Tissoires <bentiss@kernel.org>
> >
> > ---
> >
> > no changes in v5
> >
> > changes in v4:
> > - dropped __bpf_timer_compute_key()
> > - use a spin_lock instead of a semaphore
> > - ensure bpf_timer_cancel_and_free is not complaining about
> >   non sleepable context and use cancel_work() instead of
> >   cancel_work_sync()
> > - return -EINVAL if a delay is given to bpf_timer_start() with
> >   BPF_F_TIMER_SLEEPABLE
> >
> > changes in v3:
> > - extracted the implementation in bpf_timer only, without
> >   bpf_timer_set_sleepable_cb()
> > - rely on schedule_work() only, from bpf_timer_start()
> > - add semaphore to ensure bpf_timer_work_cb() is accessing
> >   consistent data
> >
> > changes in v2 (compared to the one attaches to v1 0/9):
> > - make use of a kfunc
> > - add a (non-used) BPF_F_TIMER_SLEEPABLE
> > - the callback is *not* called, it makes the kernel crashes
> > ---
> >  include/uapi/linux/bpf.h |  4 +++
> >  kernel/bpf/helpers.c     | 86 ++++++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 88 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 3c42b9f1bada..b90def29d796 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -7461,10 +7461,14 @@ struct bpf_core_relo {
> >   *     - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is
> >   *       relative to current time.
> >   *     - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller.
> > + *     - BPF_F_TIMER_SLEEPABLE: Timer will run in a sleepable context, with
> > + *       no guarantees of ordering nor timing (consider this as being just
> > + *       offloaded immediately).
> >   */
> >  enum {
> >         BPF_F_TIMER_ABS = (1ULL << 0),
> >         BPF_F_TIMER_CPU_PIN = (1ULL << 1),
> > +       BPF_F_TIMER_SLEEPABLE = (1ULL << 2),
> >  };
> >
> >  /* BPF numbers iterator state */
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index a89587859571..38de73a9df83 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1094,14 +1094,20 @@ const struct bpf_func_proto bpf_snprintf_proto = {
> >   * bpf_timer_cancel() cancels the timer and decrements prog's refcnt.
> >   * Inner maps can contain bpf timers as well. ops->map_release_uref is
> >   * freeing the timers when inner map is replaced or deleted by user space.
> > + *
> > + * sleepable_lock protects only the setup of the workqueue, not the callback
> > + * itself. This is done to ensure we don't run concurrently a free of the
> > + * callback or the associated program.
>
> I recall there was a discussion about this lock earlier,
> but I don't remember what the conclusion was.

There wasn't much of a conclusion TBH.

> The above comment is not enough to understand what it protects.
>
> In general how sleepable cb is fundamentally different
> from non-sleepable one when it comes to races ?

I think there are 2 main differences:
- it's sleepable, so classic RCU locking doesn't work (I didn't know
about rcu_read_lock_trace() up to now)
- when we cancel(_and_free) the program, we can not afford to wait for
the program to finish because that program might take ages to do so.

While OTOH, hrtimer callbacks are in soft IRQ, so with IRQ disabled,
and nothing can interrupt it AFAICT. We can also wait for the timer cb
to finish in that case because it can't sleep.

>
> bpf_timer_set_callback() is racy for both sleepable and non-sleepable
> and the latter handles it fine.

I don't think bpf_timer_set_callback() is the problem: in both cases
(sleepable or not) we are under the spinlock from bpf_timer_kern so
the race is cut short there.

>
> Note that struct bpf_hrtimer is rcu protected.
> See kfree_rcu(t, rcu); in bpf_timer_cancel_and_free().

Sorry, RCU is still always hard to grasp for me, and even if I think I
get it, I still don't see how this would be sufficient in sleepable
bpf_timer_work_cb() without any lock protecting the struct bpf_hrtimer
very first access.

>
> >   */
> >  struct bpf_hrtimer {
> >         struct hrtimer timer;
> > +       struct work_struct work;
> >         struct bpf_map *map;
> >         struct bpf_prog *prog;
> >         void __rcu *callback_fn;
> >         void *value;
> >         struct rcu_head rcu;
> > +       spinlock_t sleepable_lock;
> >  };
> >
> >  /* the actual struct hidden inside uapi struct bpf_timer */
> > @@ -1114,6 +1120,49 @@ struct bpf_timer_kern {
> >         struct bpf_spin_lock lock;
> >  } __attribute__((aligned(8)));
> >
> > +static void bpf_timer_work_cb(struct work_struct *work)
> > +{
> > +       struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
> > +       struct bpf_map *map = t->map;
> > +       bpf_callback_t callback_fn;
> > +       void *value = t->value;
> > +       unsigned long flags;
> > +       void *key;
> > +       u32 idx;
> > +
> > +       BTF_TYPE_EMIT(struct bpf_timer);
> > +
> > +       spin_lock_irqsave(&t->sleepable_lock, flags);
> > +
> > +       callback_fn = READ_ONCE(t->callback_fn);
> > +       if (!callback_fn) {
> > +               spin_unlock_irqrestore(&t->sleepable_lock, flags);
> > +               return;
> > +       }
> > +
> > +       if (map->map_type == BPF_MAP_TYPE_ARRAY) {
> > +               struct bpf_array *array = container_of(map, struct bpf_array, map);
> > +
> > +               /* compute the key */
> > +               idx = ((char *)value - array->value) / array->elem_size;
> > +               key = &idx;
> > +       } else { /* hash or lru */
> > +               key = value - round_up(map->key_size, 8);
> > +       }
> > +
> > +       /* prevent the callback to be freed by bpf_timer_cancel() while running
> > +        * so we can release the sleepable lock
> > +        */
> > +       bpf_prog_inc(t->prog);
> > +
> > +       spin_unlock_irqrestore(&t->sleepable_lock, flags);
>
> why prog_inc ?
> The sleepable progs need rcu_read_lock_trace() + migrate_disable()
> anyway, which are missing here.
> Probably best to call __bpf_prog_enter_sleepable_recur()
> like kern_sys_bpf() does.

Sounds like a good idea.

But as I was playing with it, I realized that t->prog is not RCU
protected, so I have no guarantees that the value is correct while
calling __bpf_prog_enter_sleepable_recur(t->prog, &run_ctx)...

Should I manually call first rcu_read_lock_trace() before
__bpf_prog_enter_sleepable_recur(t->prog, &run_ctx)?

>
> Now with that, the bpf_timer_cancel() can drop prog refcnt to zero
> and it's ok, since rcu_read_lock_trace() will protect it.

OK, this is a good step forward, thanks!

>
> > +
> > +       callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
> > +       /* The verifier checked that return value is zero. */
>
> the prog will finish and will be freed after rcu_read_unlock_trace().
> Seems fine to me. No need for inc/dec refcnt.

Ack

>
> > +
> > +       bpf_prog_put(t->prog);
> > +}
> > +
> >  static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
> >
> >  static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
> > @@ -1192,6 +1241,8 @@ BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
> >         t->prog = NULL;
> >         rcu_assign_pointer(t->callback_fn, NULL);
> >         hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
> > +       INIT_WORK(&t->work, bpf_timer_work_cb);
> > +       spin_lock_init(&t->sleepable_lock);
> >         t->timer.function = bpf_timer_cb;
> >         WRITE_ONCE(timer->timer, t);
> >         /* Guarantee the order between timer->timer and map->usercnt. So
> > @@ -1237,6 +1288,7 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> >                 ret = -EINVAL;
> >                 goto out;
> >         }
> > +       spin_lock(&t->sleepable_lock);
> >         if (!atomic64_read(&t->map->usercnt)) {
> >                 /* maps with timers must be either held by user space
> >                  * or pinned in bpffs. Otherwise timer might still be
> > @@ -1263,6 +1315,8 @@ BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
> >         }
> >         rcu_assign_pointer(t->callback_fn, callback_fn);
> >  out:
> > +       if (t)
> > +               spin_unlock(&t->sleepable_lock);
> >         __bpf_spin_unlock_irqrestore(&timer->lock);
>
> If lock is really needed why timer->lock cannot be reused?
> The pattern of two locks in pretty much the same data structure
> is begging for questions about what is going on here.

Agree, but I can't find a way to reuse timer->lock:
- ideally I should add struct work_struct into struct bpf_timer_kern
directly, but there is a warning about the size of bpf_timer_kern
which makes me feel like we can not extend it
- adding a pointer back from struct bpf_hrtimer to bpf_timer_kern is
also not a solution as we might be freed if we are outside of the lock
in bpf_timer_kern...

Though if I have reliable access from bpf_timer_work_cb() to the
matching bpf_timer_kern, I could spinlock ->lock while I need to
access ->timer, and then everything would be much easier.

>
> >         return ret;
> >  }
> > @@ -1283,8 +1337,12 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
> >
> >         if (in_nmi())
> >                 return -EOPNOTSUPP;
> > -       if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
> > +       if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN | BPF_F_TIMER_SLEEPABLE))
> >                 return -EINVAL;
> > +
> > +       if ((flags & BPF_F_TIMER_SLEEPABLE) && nsecs)
> > +               return -EINVAL;
> > +
> >         __bpf_spin_lock_irqsave(&timer->lock);
> >         t = timer->timer;
> >         if (!t || !t->prog) {
> > @@ -1300,7 +1358,10 @@ BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
> >         if (flags & BPF_F_TIMER_CPU_PIN)
> >                 mode |= HRTIMER_MODE_PINNED;
> >
> > -       hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> > +       if (flags & BPF_F_TIMER_SLEEPABLE)
> > +               schedule_work(&t->work);
> > +       else
> > +               hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
> >  out:
> >         __bpf_spin_unlock_irqrestore(&timer->lock);
> >         return ret;
> > @@ -1348,13 +1409,22 @@ BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
> >                 ret = -EDEADLK;
> >                 goto out;
> >         }
> > +       spin_lock(&t->sleepable_lock);
> >         drop_prog_refcnt(t);
> > +       spin_unlock(&t->sleepable_lock);
>
> this also looks odd.

I basically need to protect "t->prog = NULL;" from happening while
bpf_timer_work_cb is setting up the bpf program to be run.

>
> >  out:
> >         __bpf_spin_unlock_irqrestore(&timer->lock);
> >         /* Cancel the timer and wait for associated callback to finish
> >          * if it was running.
> >          */
> >         ret = ret ?: hrtimer_cancel(&t->timer);
> > +
> > +       /* also cancel the sleepable work, but *do not* wait for
> > +        * it to finish if it was running as we might not be in a
> > +        * sleepable context
> > +        */
> > +       ret = ret ?: cancel_work(&t->work);
> > +
> >         rcu_read_unlock();
> >         return ret;
> >  }
> > @@ -1383,11 +1453,13 @@ void bpf_timer_cancel_and_free(void *val)
> >         t = timer->timer;
> >         if (!t)
> >                 goto out;
> > +       spin_lock(&t->sleepable_lock);
> >         drop_prog_refcnt(t);
> >         /* The subsequent bpf_timer_start/cancel() helpers won't be able to use
> >          * this timer, since it won't be initialized.
> >          */
> >         WRITE_ONCE(timer->timer, NULL);
> > +       spin_unlock(&t->sleepable_lock);
>
> This one I don't understand either.

Same as above, I do not want t->prog to be set to NULL.

Also, side note: if anyone feels like it would go faster to fix those
races by themself instead of teaching me how to properly do it, this
is definitely fine from me :)

Cheers,
Benjamin
Alexei Starovoitov April 3, 2024, 6:50 p.m. UTC | #2
On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
<benjamin.tissoires@redhat.com> wrote:
> > >                 goto out;
> > >         }
> > > +       spin_lock(&t->sleepable_lock);
> > >         drop_prog_refcnt(t);
> > > +       spin_unlock(&t->sleepable_lock);
> >
> > this also looks odd.
>
> I basically need to protect "t->prog = NULL;" from happening while
> bpf_timer_work_cb is setting up the bpf program to be run.

Ok. I think I understand the race you're trying to fix.
The bpf_timer_cancel_and_free() is doing
cancel_work()
and proceeds with
kfree_rcu(t, rcu);

That's the only race and these extra locks don't help.

The t->prog = NULL is nothing to worry about.
The bpf_timer_work_cb() might still see callback_fn == NULL
"when it's being setup" and it's ok.
These locks don't help that.

I suggest to drop sleepable_lock everywhere.
READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
Add rcu_read_lock_trace() before calling bpf prog.

The race to fix is above 'cancel_work + kfree_rcu'
since kfree_rcu might free 'struct bpf_hrtimer *t'
while the work is pending and work_queue internal
logic might UAF struct work_struct work.
By the time it may luckily enter bpf_timer_work_cb() it's too late.
The argument 'struct work_struct *work' might already be freed.

To fix this problem, how about the following:
don't call kfree_rcu and instead queue the work to free it.
After cancel_work(&t->work); the work_struct can be reused.
So set it up to call "freeing callback" and do
schedule_work(&t->work);

There is a big assumption here that new work won't be
executed before cancelled work completes.
Need to check with wq experts.

Another approach is to do something smart with
cancel_work() return code.
If it returns true set a flag inside bpf_hrtimer and
make bpf_timer_work_cb() free(t) after bpf prog finishes.

> Also, side note: if anyone feels like it would go faster to fix those
> races by themself instead of teaching me how to properly do it, this
> is definitely fine from me :)

Most of the time goes into analyzing and thinking :)
Whoever codes it doesn't speed things much.
Pls do another respin if you still have cycles to work on it.
Benjamin Tissoires April 4, 2024, 3:26 p.m. UTC | #3
On Thu, Apr 4, 2024 at 4:44 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
> > > <benjamin.tissoires@redhat.com> wrote:
> > > > > >                 goto out;
> > > > > >         }
> > > > > > +       spin_lock(&t->sleepable_lock);
> > > > > >         drop_prog_refcnt(t);
> > > > > > +       spin_unlock(&t->sleepable_lock);
> > > > >
> > > > > this also looks odd.
> > > >
> > > > I basically need to protect "t->prog = NULL;" from happening while
> > > > bpf_timer_work_cb is setting up the bpf program to be run.
> > >
> > > Ok. I think I understand the race you're trying to fix.
> > > The bpf_timer_cancel_and_free() is doing
> > > cancel_work()
> > > and proceeds with
> > > kfree_rcu(t, rcu);
> > >
> > > That's the only race and these extra locks don't help.


Thanks a lot for pinpointing the location of the race. Indeed, when I
read your email this morning I said "of course, this was obvious" :(

>
> > >
> > > The t->prog = NULL is nothing to worry about.
> > > The bpf_timer_work_cb() might still see callback_fn == NULL
> > > "when it's being setup" and it's ok.
> > > These locks don't help that.
> > >
> > > I suggest to drop sleepable_lock everywhere.
> > > READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> > > Add rcu_read_lock_trace() before calling bpf prog.
> > >
> > > The race to fix is above 'cancel_work + kfree_rcu'
> > > since kfree_rcu might free 'struct bpf_hrtimer *t'
> > > while the work is pending and work_queue internal
> > > logic might UAF struct work_struct work.
> > > By the time it may luckily enter bpf_timer_work_cb() it's too late.
> > > The argument 'struct work_struct *work' might already be freed.
> > >
> > > To fix this problem, how about the following:
> > > don't call kfree_rcu and instead queue the work to free it.
> > > After cancel_work(&t->work); the work_struct can be reused.
> > > So set it up to call "freeing callback" and do
> > > schedule_work(&t->work);
> > >
> > > There is a big assumption here that new work won't be
> > > executed before cancelled work completes.
> > > Need to check with wq experts.
> > >
> > > Another approach is to do something smart with
> > > cancel_work() return code.
> > > If it returns true set a flag inside bpf_hrtimer and
> > > make bpf_timer_work_cb() free(t) after bpf prog finishes.
> >
> > Looking through wq code... I think I have to correct myself.
> > cancel_work and immediate free is probably fine from wq pov.
> > It has this comment:
> >         worker->current_func(work);
> >         /*
> >          * While we must be careful to not use "work" after this, the trace
> >          * point will only record its address.
> >          */
> >         trace_workqueue_execute_end(work, worker->current_func);
> >
> > the bpf_timer_work_cb() might still be running bpf prog.
> > So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
> > since kfree_rcu(t, rcu); could have freed it by then.
> > There is also this code in net/rxrpc/rxperf.c
> >         cancel_work(&call->work);
> >         kfree(call);
>
> Correction to correction.
> Above piece in rxrpc is buggy.
> The following race is possible:
> cpu 0
> process_one_work()
> set_work_pool_and_clear_pending(work, pool->id, 0);
>
>     cpu 1
>     cancel_work()
>     kfree_rcu(work)
>
> worker->current_func(work);
>
> Here 'work' is a pointer to freed memory.
> Though wq code will not be touching it, callback will UAF.
>
> Also what I proposed earlier as:
> INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work();
> won't guarantee the ordering.
> Since the callback function is different,
> find_worker_executing_work() will consider it a separate work item.
>
> Another option is to to keep bpf_timer_work_cb callback
> and add a 'bool free_me;' to struct bpf_hrtimer
> and let the callback free it.
> But it's also racy.
> cancel_work() may return false, though worker->current_func(work)
> wasn't called yet.
> So we cannot set 'free_me' in bpf_timer_cancel_and_free()
> in race free maner.
>
> After brainstorming with Tejun it seems the best is to use
> another work_struct to call a different callback and do
> cancel_work_sync() there.

Works for me. I should be able to spina v6 soon enough, but I have a
couple of remaining questions below:

>
> So we need something like:
>
> struct bpf_hrtimer {
>   union {
>     struct hrtimer timer;
> +   struct work_struct work;
>   };
>   struct bpf_map *map;
>   struct bpf_prog *prog;
>   void __rcu *callback_fn;
>   void *value;
>   union {

Are you sure we need an union here? If we get to call kfree_rcu() we
need to have both struct rcu_head and sync_work, not one or the other.

>     struct rcu_head rcu;
> +   struct work_struct sync_work;
>   };
> + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE

If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
(like in my v2 or v3 IIRC). But that means that once a timer is
initialized it needs to be of one or the other type (especially true
with the first union in this struct).

So should we reject during run time bpf_timer_set_callback() for
sleepable timers and only allow bpf_timer_set_sleepable_cb() for
those? (and the invert in the other case).

This version of the patch allows for one timer to be used as softIRQ
or WQ, depending on the timer_set_callback that is used. But it might
be simpler for the kfree_rcu race to define the bpf_timer to be of one
kind, so we are sure to call the correct kfree method.

>  };
>
> 'work' will be used to call bpf_timer_work_cb.
> 'sync_work' will be used to call cancel_work_sync() + kfree_rcu().
>
> And, of course,
> schedule_work(&t->sync_work); from bpf_timer_cancel_and_free()
> instead of kfree_rcu.
>

Cheers,
Benjamin
Alexei Starovoitov April 4, 2024, 4:40 p.m. UTC | #4
On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
<benjamin.tissoires@redhat.com> wrote:
>
>
> >
> > So we need something like:
> >
> > struct bpf_hrtimer {
> >   union {
> >     struct hrtimer timer;
> > +   struct work_struct work;
> >   };
> >   struct bpf_map *map;
> >   struct bpf_prog *prog;
> >   void __rcu *callback_fn;
> >   void *value;
> >   union {
>
> Are you sure we need an union here? If we get to call kfree_rcu() we
> need to have both struct rcu_head and sync_work, not one or the other.

why? with an extra flag it's one or the other.
In bpf_timer_cancel_and_free()
if (flag & SLEEPABLE) {
    schedule_work() to cancel_work_sync + kfree_rcu
} else {
   hrtimer_cancel
   kfree_rcu
}

> >     struct rcu_head rcu;
> > +   struct work_struct sync_work;
> >   };
> > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
>
> If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> (like in my v2 or v3 IIRC). But that means that once a timer is
> initialized it needs to be of one or the other type (especially true
> with the first union in this struct).

yes. That's an idea.
The code to support wq vs timer seems to be diverging more
than what we expected initially.
It seems cleaner to set it as init time and enforce in
other helpers.

Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
too far.
It's already at 112 bytes and some people use bpf_timer per flow.
So potentially millions of such timers.
Adding extra sizeof(struct work_struct)=32 * 2 that won't be
used is too much.
Note that sizeof(struct hrtimer)=64, so unions make everything
fit nicely.

> So should we reject during run time bpf_timer_set_callback() for
> sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> those? (and the invert in the other case).

yes.

> This version of the patch allows for one timer to be used as softIRQ
> or WQ, depending on the timer_set_callback that is used. But it might
> be simpler for the kfree_rcu race to define the bpf_timer to be of one
> kind, so we are sure to call the correct kfree method.

I think one or another simplifies the code and makes it easier
to think through combinations.

I'm still contemplating adding new "struct bpf_wq" and new kfuncs
to completely separate wq vs timer.
The code reuse seems to be relatively small.
We can potentially factor out internals of bpf_timer_* into smaller
helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.

One more thing.
bpf_timer_cancel() api turned out to be troublesome.
Not only it cancels the timer, but drops callback too.
It was a surprising behavior for people familiar with
kernel timer api-s.
We should not repeat this mistake with wq.

We can try to fix bpf_timer_cancel() too.
If we drop drop_prog_refcnt() from it it shouldn't affect
existing bpf_timer users who are forced to do:
bpf_timer_cancel()
bpf_timer_set_callback()
bpf_timer_start()
all the time.
If/when bpf_timer_cancel() stops dropping the callback
such bpf prog won't be affected. So low chance of breaking any prog.
wdyt?
Benjamin Tissoires April 4, 2024, 5:56 p.m. UTC | #5
On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
> <benjamin.tissoires@redhat.com> wrote:
> >
> >
> > >
> > > So we need something like:
> > >
> > > struct bpf_hrtimer {
> > >   union {
> > >     struct hrtimer timer;
> > > +   struct work_struct work;
> > >   };
> > >   struct bpf_map *map;
> > >   struct bpf_prog *prog;
> > >   void __rcu *callback_fn;
> > >   void *value;
> > >   union {
> >
> > Are you sure we need an union here? If we get to call kfree_rcu() we
> > need to have both struct rcu_head and sync_work, not one or the other.
>
> why? with an extra flag it's one or the other.
> In bpf_timer_cancel_and_free()
> if (flag & SLEEPABLE) {
>     schedule_work() to cancel_work_sync + kfree_rcu
> } else {
>    hrtimer_cancel
>    kfree_rcu
> }

I thought kfree_rcu required struct rcu_head, and given that we need
to initialize sync_work it will be poisoned...

>
> > >     struct rcu_head rcu;
> > > +   struct work_struct sync_work;
> > >   };
> > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
> >
> > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> > (like in my v2 or v3 IIRC). But that means that once a timer is
> > initialized it needs to be of one or the other type (especially true
> > with the first union in this struct).
>
> yes. That's an idea.
> The code to support wq vs timer seems to be diverging more
> than what we expected initially.
> It seems cleaner to set it as init time and enforce in
> other helpers.

OK, works for me.

>
> Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
> too far.
> It's already at 112 bytes and some people use bpf_timer per flow.
> So potentially millions of such timers.
> Adding extra sizeof(struct work_struct)=32 * 2 that won't be
> used is too much.
> Note that sizeof(struct hrtimer)=64, so unions make everything
> fit nicely.

Maybe we should do
union {
  struct hrtimer timer;
  struct {
    struct work_struct work;
    struct work_struct sync_work;
  }
}

(not nice to read but at least we don't change the size at the beginning)

>
> > So should we reject during run time bpf_timer_set_callback() for
> > sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> > those? (and the invert in the other case).
>
> yes.
>
> > This version of the patch allows for one timer to be used as softIRQ
> > or WQ, depending on the timer_set_callback that is used. But it might
> > be simpler for the kfree_rcu race to define the bpf_timer to be of one
> > kind, so we are sure to call the correct kfree method.
>
> I think one or another simplifies the code and makes it easier
> to think through combinations.
>
> I'm still contemplating adding new "struct bpf_wq" and new kfuncs
> to completely separate wq vs timer.
> The code reuse seems to be relatively small.

There is some code reuse in the verifier, but it can be factored out I think.

Though the biggest reuse might be in the map portion of bpf_timer,
which I haven't looked much TBH.

> We can potentially factor out internals of bpf_timer_* into smaller
> helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.

Yeah, also, given that we are going to enforce delay == 0 for
sleepable timers (wq), the user api would be much cleaner if we can
have a dedicated bpf_wq (and it would make the flags of bpf_timer_init
easier to deal with).

>
> One more thing.
> bpf_timer_cancel() api turned out to be troublesome.
> Not only it cancels the timer, but drops callback too.
> It was a surprising behavior for people familiar with
> kernel timer api-s.
> We should not repeat this mistake with wq.
>
> We can try to fix bpf_timer_cancel() too.
> If we drop drop_prog_refcnt() from it it shouldn't affect
> existing bpf_timer users who are forced to do:
> bpf_timer_cancel()
> bpf_timer_set_callback()
> bpf_timer_start()
> all the time.
> If/when bpf_timer_cancel() stops dropping the callback
> such bpf prog won't be affected. So low chance of breaking any prog.
> wdyt?
>

How would a program know set_callback() is not required after a
cancel() because the kernel kept it around? It seems that it's going
to be hard for them to know that (unless by trying first a start()),
and it will add more code.

timer_cancel() would be hard to change but we can always do the change
and add a new kfunc timer_cancel_no_drop() which would clearly allow
for new programs to know that set_callback() is not required to be
called. In a few kernel releases we could remove it and say that
timer_cancel() is the same (and replaced by a #define)

Anyway, the more I think of it, the more I think the best API would be
a dedicated wq API. It's probably going to need a little bit more
work, but it'll be more or less this work plus the new bpf_wq type in
the map.

Cheers,
Benjamin
Alexei Starovoitov April 4, 2024, 6:29 p.m. UTC | #6
On Thu, Apr 4, 2024 at 10:56 AM Benjamin Tissoires
<benjamin.tissoires@redhat.com> wrote:
>
> On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
> > <benjamin.tissoires@redhat.com> wrote:
> > >
> > >
> > > >
> > > > So we need something like:
> > > >
> > > > struct bpf_hrtimer {
> > > >   union {
> > > >     struct hrtimer timer;
> > > > +   struct work_struct work;
> > > >   };
> > > >   struct bpf_map *map;
> > > >   struct bpf_prog *prog;
> > > >   void __rcu *callback_fn;
> > > >   void *value;
> > > >   union {
> > >
> > > Are you sure we need an union here? If we get to call kfree_rcu() we
> > > need to have both struct rcu_head and sync_work, not one or the other.
> >
> > why? with an extra flag it's one or the other.
> > In bpf_timer_cancel_and_free()
> > if (flag & SLEEPABLE) {
> >     schedule_work() to cancel_work_sync + kfree_rcu
> > } else {
> >    hrtimer_cancel
> >    kfree_rcu
> > }
>
> I thought kfree_rcu required struct rcu_head, and given that we need
> to initialize sync_work it will be poisoned...

yes. It needs rcu_head.
But where do you see a conflict?
INIT_WORK + schedule_work() will use that space,
then cancel_work_sync() will wait on a different work_struct,
then kfree_rcu() will reuse that space.

In case of hrtimers none of the work_structs will be used.

>
> >
> > > >     struct rcu_head rcu;
> > > > +   struct work_struct sync_work;
> > > >   };
> > > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
> > >
> > > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> > > (like in my v2 or v3 IIRC). But that means that once a timer is
> > > initialized it needs to be of one or the other type (especially true
> > > with the first union in this struct).
> >
> > yes. That's an idea.
> > The code to support wq vs timer seems to be diverging more
> > than what we expected initially.
> > It seems cleaner to set it as init time and enforce in
> > other helpers.
>
> OK, works for me.
>
> >
> > Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
> > too far.
> > It's already at 112 bytes and some people use bpf_timer per flow.
> > So potentially millions of such timers.
> > Adding extra sizeof(struct work_struct)=32 * 2 that won't be
> > used is too much.
> > Note that sizeof(struct hrtimer)=64, so unions make everything
> > fit nicely.
>
> Maybe we should do
> union {
>   struct hrtimer timer;
>   struct {
>     struct work_struct work;
>     struct work_struct sync_work;
>   }
> }

It's also ok, but sharing rcu_head and work_struct seems
cleaner, since it highlights that they're exclusive.

> (not nice to read but at least we don't change the size at the beginning)
>
> >
> > > So should we reject during run time bpf_timer_set_callback() for
> > > sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> > > those? (and the invert in the other case).
> >
> > yes.
> >
> > > This version of the patch allows for one timer to be used as softIRQ
> > > or WQ, depending on the timer_set_callback that is used. But it might
> > > be simpler for the kfree_rcu race to define the bpf_timer to be of one
> > > kind, so we are sure to call the correct kfree method.
> >
> > I think one or another simplifies the code and makes it easier
> > to think through combinations.
> >
> > I'm still contemplating adding new "struct bpf_wq" and new kfuncs
> > to completely separate wq vs timer.
> > The code reuse seems to be relatively small.
>
> There is some code reuse in the verifier, but it can be factored out I think.
>
> Though the biggest reuse might be in the map portion of bpf_timer,
> which I haven't looked much TBH.

Right. It's all the 'case BPF_TIMER:' in various places.
New 'struct bpf_wq' would need another entry in btf_field_type.
But that should be a straightforward addition.

>
> > We can potentially factor out internals of bpf_timer_* into smaller
> > helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
>
> Yeah, also, given that we are going to enforce delay == 0 for
> sleepable timers (wq), the user api would be much cleaner if we can
> have a dedicated bpf_wq (and it would make the flags of bpf_timer_init
> easier to deal with).

It seems so.
Kinda hard to judge one way or the other without looking at
the final code, but it seems separation is worth attempting, at least.

Also if we ever do hrtimer+wq we probably will be using
'struct delayed_work' instead of rolling our own
'struct hrtimer' + 'struct work_struct' combo.

It seems wq logic already made such a combination special enough
and thought through the races, so we better just follow that path.
In that case it might be yet another 'struct bpf_delayed_wq'
and another set of kfuncs.
Considering that cancel_work() and cancel_delayed_work()
are separate api in the kernel.
Multiplexing all of them under bpf_timer_cancel()
seems wrong.
In the past we were somewhat limited in terms of helpers.
We tried not to add them unless absolutely necessary because
of uapi considerations.
Now with kfuncs we can add/tweak/remove them at will.

>
> >
> > One more thing.
> > bpf_timer_cancel() api turned out to be troublesome.
> > Not only it cancels the timer, but drops callback too.
> > It was a surprising behavior for people familiar with
> > kernel timer api-s.
> > We should not repeat this mistake with wq.
> >
> > We can try to fix bpf_timer_cancel() too.
> > If we drop drop_prog_refcnt() from it it shouldn't affect
> > existing bpf_timer users who are forced to do:
> > bpf_timer_cancel()
> > bpf_timer_set_callback()
> > bpf_timer_start()
> > all the time.
> > If/when bpf_timer_cancel() stops dropping the callback
> > such bpf prog won't be affected. So low chance of breaking any prog.
> > wdyt?
> >
>
> How would a program know set_callback() is not required after a
> cancel() because the kernel kept it around? It seems that it's going
> to be hard for them to know that (unless by trying first a start()),
> and it will add more code.
>
> timer_cancel() would be hard to change but we can always do the change
> and add a new kfunc timer_cancel_no_drop() which would clearly allow

that works too.

> for new programs to know that set_callback() is not required to be
> called. In a few kernel releases we could remove it and say that
> timer_cancel() is the same (and replaced by a #define)

#define won't work, since mechanics of detecting and calling
helpers vs kfuncs is quite different.

> Anyway, the more I think of it, the more I think the best API would be
> a dedicated wq API. It's probably going to need a little bit more
> work, but it'll be more or less this work plus the new bpf_wq type in
> the map.

It seems to me as well.

Thanks for brainstorming.
Benjamin Tissoires April 5, 2024, 3:46 p.m. UTC | #7
On Thu, Apr 4, 2024 at 8:29 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Apr 4, 2024 at 10:56 AM Benjamin Tissoires
> <benjamin.tissoires@redhat.com> wrote:
> >
> > On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires
> > > <benjamin.tissoires@redhat.com> wrote:
> > > >
> > > >
> > > > >
> > > > > So we need something like:
> > > > >
> > > > > struct bpf_hrtimer {
> > > > >   union {
> > > > >     struct hrtimer timer;
> > > > > +   struct work_struct work;
> > > > >   };
> > > > >   struct bpf_map *map;
> > > > >   struct bpf_prog *prog;
> > > > >   void __rcu *callback_fn;
> > > > >   void *value;
> > > > >   union {
> > > >
> > > > Are you sure we need an union here? If we get to call kfree_rcu() we
> > > > need to have both struct rcu_head and sync_work, not one or the other.
> > >
> > > why? with an extra flag it's one or the other.
> > > In bpf_timer_cancel_and_free()
> > > if (flag & SLEEPABLE) {
> > >     schedule_work() to cancel_work_sync + kfree_rcu
> > > } else {
> > >    hrtimer_cancel
> > >    kfree_rcu
> > > }
> >
> > I thought kfree_rcu required struct rcu_head, and given that we need
> > to initialize sync_work it will be poisoned...
>
> yes. It needs rcu_head.
> But where do you see a conflict?
> INIT_WORK + schedule_work() will use that space,
> then cancel_work_sync() will wait on a different work_struct,
> then kfree_rcu() will reuse that space.

Yeah, sorry, I haven't realized that the memory used by kfree_rcu
wasn't initialized.

>
> In case of hrtimers none of the work_structs will be used.
>
> >
> > >
> > > > >     struct rcu_head rcu;
> > > > > +   struct work_struct sync_work;
> > > > >   };
> > > > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
> > > >
> > > > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
> > > > (like in my v2 or v3 IIRC). But that means that once a timer is
> > > > initialized it needs to be of one or the other type (especially true
> > > > with the first union in this struct).
> > >
> > > yes. That's an idea.
> > > The code to support wq vs timer seems to be diverging more
> > > than what we expected initially.
> > > It seems cleaner to set it as init time and enforce in
> > > other helpers.
> >
> > OK, works for me.
> >
> > >
> > > Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer)
> > > too far.
> > > It's already at 112 bytes and some people use bpf_timer per flow.
> > > So potentially millions of such timers.
> > > Adding extra sizeof(struct work_struct)=32 * 2 that won't be
> > > used is too much.
> > > Note that sizeof(struct hrtimer)=64, so unions make everything
> > > fit nicely.
> >
> > Maybe we should do
> > union {
> >   struct hrtimer timer;
> >   struct {
> >     struct work_struct work;
> >     struct work_struct sync_work;
> >   }
> > }
>
> It's also ok, but sharing rcu_head and work_struct seems
> cleaner, since it highlights that they're exclusive.
>
> > (not nice to read but at least we don't change the size at the beginning)
> >
> > >
> > > > So should we reject during run time bpf_timer_set_callback() for
> > > > sleepable timers and only allow bpf_timer_set_sleepable_cb() for
> > > > those? (and the invert in the other case).
> > >
> > > yes.
> > >
> > > > This version of the patch allows for one timer to be used as softIRQ
> > > > or WQ, depending on the timer_set_callback that is used. But it might
> > > > be simpler for the kfree_rcu race to define the bpf_timer to be of one
> > > > kind, so we are sure to call the correct kfree method.
> > >
> > > I think one or another simplifies the code and makes it easier
> > > to think through combinations.
> > >
> > > I'm still contemplating adding new "struct bpf_wq" and new kfuncs
> > > to completely separate wq vs timer.
> > > The code reuse seems to be relatively small.
> >
> > There is some code reuse in the verifier, but it can be factored out I think.
> >
> > Though the biggest reuse might be in the map portion of bpf_timer,
> > which I haven't looked much TBH.
>
> Right. It's all the 'case BPF_TIMER:' in various places.
> New 'struct bpf_wq' would need another entry in btf_field_type.
> But that should be a straightforward addition.
>
> >
> > > We can potentially factor out internals of bpf_timer_* into smaller
> > > helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs.
> >
> > Yeah, also, given that we are going to enforce delay == 0 for
> > sleepable timers (wq), the user api would be much cleaner if we can
> > have a dedicated bpf_wq (and it would make the flags of bpf_timer_init
> > easier to deal with).
>
> It seems so.
> Kinda hard to judge one way or the other without looking at
> the final code, but it seems separation is worth attempting, at least.
>
> Also if we ever do hrtimer+wq we probably will be using
> 'struct delayed_work' instead of rolling our own
> 'struct hrtimer' + 'struct work_struct' combo.
>
> It seems wq logic already made such a combination special enough
> and thought through the races, so we better just follow that path.
> In that case it might be yet another 'struct bpf_delayed_wq'
> and another set of kfuncs.
> Considering that cancel_work() and cancel_delayed_work()
> are separate api in the kernel.
> Multiplexing all of them under bpf_timer_cancel()
> seems wrong.
> In the past we were somewhat limited in terms of helpers.
> We tried not to add them unless absolutely necessary because
> of uapi considerations.
> Now with kfuncs we can add/tweak/remove them at will.
>
> >
> > >
> > > One more thing.
> > > bpf_timer_cancel() api turned out to be troublesome.
> > > Not only it cancels the timer, but drops callback too.
> > > It was a surprising behavior for people familiar with
> > > kernel timer api-s.
> > > We should not repeat this mistake with wq.
> > >
> > > We can try to fix bpf_timer_cancel() too.
> > > If we drop drop_prog_refcnt() from it it shouldn't affect
> > > existing bpf_timer users who are forced to do:
> > > bpf_timer_cancel()
> > > bpf_timer_set_callback()
> > > bpf_timer_start()
> > > all the time.
> > > If/when bpf_timer_cancel() stops dropping the callback
> > > such bpf prog won't be affected. So low chance of breaking any prog.
> > > wdyt?
> > >
> >
> > How would a program know set_callback() is not required after a
> > cancel() because the kernel kept it around? It seems that it's going
> > to be hard for them to know that (unless by trying first a start()),
> > and it will add more code.
> >
> > timer_cancel() would be hard to change but we can always do the change
> > and add a new kfunc timer_cancel_no_drop() which would clearly allow
>
> that works too.
>
> > for new programs to know that set_callback() is not required to be
> > called. In a few kernel releases we could remove it and say that
> > timer_cancel() is the same (and replaced by a #define)
>
> #define won't work, since mechanics of detecting and calling
> helpers vs kfuncs is quite different.
>
> > Anyway, the more I think of it, the more I think the best API would be
> > a dedicated wq API. It's probably going to need a little bit more
> > work, but it'll be more or less this work plus the new bpf_wq type in
> > the map.
>
> It seems to me as well.
>
> Thanks for brainstorming.
>

Alright, as of today (and I'm about to be AFK for the weekend), I got
your changes in and working (I think). I'll review the series on
Monday and send it back so we have a baseline to compare it to with
bpf_wq.

Cheers,
Benjamin
Alexei Starovoitov April 5, 2024, 4:07 p.m. UTC | #8
On Fri, Apr 5, 2024 at 8:46 AM Benjamin Tissoires
<benjamin.tissoires@redhat.com> wrote:
>
> Alright, as of today (and I'm about to be AFK for the weekend), I got
> your changes in and working (I think). I'll review the series on
> Monday and send it back so we have a baseline to compare it to with
> bpf_wq.

Nice! Looking forward to it.
diff mbox series

Patch

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3c42b9f1bada..b90def29d796 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7461,10 +7461,14 @@  struct bpf_core_relo {
  *     - BPF_F_TIMER_ABS: Timeout passed is absolute time, by default it is
  *       relative to current time.
  *     - BPF_F_TIMER_CPU_PIN: Timer will be pinned to the CPU of the caller.
+ *     - BPF_F_TIMER_SLEEPABLE: Timer will run in a sleepable context, with
+ *       no guarantees of ordering nor timing (consider this as being just
+ *       offloaded immediately).
  */
 enum {
 	BPF_F_TIMER_ABS = (1ULL << 0),
 	BPF_F_TIMER_CPU_PIN = (1ULL << 1),
+	BPF_F_TIMER_SLEEPABLE = (1ULL << 2),
 };
 
 /* BPF numbers iterator state */
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index a89587859571..38de73a9df83 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1094,14 +1094,20 @@  const struct bpf_func_proto bpf_snprintf_proto = {
  * bpf_timer_cancel() cancels the timer and decrements prog's refcnt.
  * Inner maps can contain bpf timers as well. ops->map_release_uref is
  * freeing the timers when inner map is replaced or deleted by user space.
+ *
+ * sleepable_lock protects only the setup of the workqueue, not the callback
+ * itself. This is done to ensure we don't run concurrently a free of the
+ * callback or the associated program.
  */
 struct bpf_hrtimer {
 	struct hrtimer timer;
+	struct work_struct work;
 	struct bpf_map *map;
 	struct bpf_prog *prog;
 	void __rcu *callback_fn;
 	void *value;
 	struct rcu_head rcu;
+	spinlock_t sleepable_lock;
 };
 
 /* the actual struct hidden inside uapi struct bpf_timer */
@@ -1114,6 +1120,49 @@  struct bpf_timer_kern {
 	struct bpf_spin_lock lock;
 } __attribute__((aligned(8)));
 
+static void bpf_timer_work_cb(struct work_struct *work)
+{
+	struct bpf_hrtimer *t = container_of(work, struct bpf_hrtimer, work);
+	struct bpf_map *map = t->map;
+	bpf_callback_t callback_fn;
+	void *value = t->value;
+	unsigned long flags;
+	void *key;
+	u32 idx;
+
+	BTF_TYPE_EMIT(struct bpf_timer);
+
+	spin_lock_irqsave(&t->sleepable_lock, flags);
+
+	callback_fn = READ_ONCE(t->callback_fn);
+	if (!callback_fn) {
+		spin_unlock_irqrestore(&t->sleepable_lock, flags);
+		return;
+	}
+
+	if (map->map_type == BPF_MAP_TYPE_ARRAY) {
+		struct bpf_array *array = container_of(map, struct bpf_array, map);
+
+		/* compute the key */
+		idx = ((char *)value - array->value) / array->elem_size;
+		key = &idx;
+	} else { /* hash or lru */
+		key = value - round_up(map->key_size, 8);
+	}
+
+	/* prevent the callback to be freed by bpf_timer_cancel() while running
+	 * so we can release the sleepable lock
+	 */
+	bpf_prog_inc(t->prog);
+
+	spin_unlock_irqrestore(&t->sleepable_lock, flags);
+
+	callback_fn((u64)(long)map, (u64)(long)key, (u64)(long)value, 0, 0);
+	/* The verifier checked that return value is zero. */
+
+	bpf_prog_put(t->prog);
+}
+
 static DEFINE_PER_CPU(struct bpf_hrtimer *, hrtimer_running);
 
 static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer)
@@ -1192,6 +1241,8 @@  BPF_CALL_3(bpf_timer_init, struct bpf_timer_kern *, timer, struct bpf_map *, map
 	t->prog = NULL;
 	rcu_assign_pointer(t->callback_fn, NULL);
 	hrtimer_init(&t->timer, clockid, HRTIMER_MODE_REL_SOFT);
+	INIT_WORK(&t->work, bpf_timer_work_cb);
+	spin_lock_init(&t->sleepable_lock);
 	t->timer.function = bpf_timer_cb;
 	WRITE_ONCE(timer->timer, t);
 	/* Guarantee the order between timer->timer and map->usercnt. So
@@ -1237,6 +1288,7 @@  BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
 		ret = -EINVAL;
 		goto out;
 	}
+	spin_lock(&t->sleepable_lock);
 	if (!atomic64_read(&t->map->usercnt)) {
 		/* maps with timers must be either held by user space
 		 * or pinned in bpffs. Otherwise timer might still be
@@ -1263,6 +1315,8 @@  BPF_CALL_3(bpf_timer_set_callback, struct bpf_timer_kern *, timer, void *, callb
 	}
 	rcu_assign_pointer(t->callback_fn, callback_fn);
 out:
+	if (t)
+		spin_unlock(&t->sleepable_lock);
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	return ret;
 }
@@ -1283,8 +1337,12 @@  BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
 
 	if (in_nmi())
 		return -EOPNOTSUPP;
-	if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN))
+	if (flags & ~(BPF_F_TIMER_ABS | BPF_F_TIMER_CPU_PIN | BPF_F_TIMER_SLEEPABLE))
 		return -EINVAL;
+
+	if ((flags & BPF_F_TIMER_SLEEPABLE) && nsecs)
+		return -EINVAL;
+
 	__bpf_spin_lock_irqsave(&timer->lock);
 	t = timer->timer;
 	if (!t || !t->prog) {
@@ -1300,7 +1358,10 @@  BPF_CALL_3(bpf_timer_start, struct bpf_timer_kern *, timer, u64, nsecs, u64, fla
 	if (flags & BPF_F_TIMER_CPU_PIN)
 		mode |= HRTIMER_MODE_PINNED;
 
-	hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
+	if (flags & BPF_F_TIMER_SLEEPABLE)
+		schedule_work(&t->work);
+	else
+		hrtimer_start(&t->timer, ns_to_ktime(nsecs), mode);
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	return ret;
@@ -1348,13 +1409,22 @@  BPF_CALL_1(bpf_timer_cancel, struct bpf_timer_kern *, timer)
 		ret = -EDEADLK;
 		goto out;
 	}
+	spin_lock(&t->sleepable_lock);
 	drop_prog_refcnt(t);
+	spin_unlock(&t->sleepable_lock);
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	/* Cancel the timer and wait for associated callback to finish
 	 * if it was running.
 	 */
 	ret = ret ?: hrtimer_cancel(&t->timer);
+
+	/* also cancel the sleepable work, but *do not* wait for
+	 * it to finish if it was running as we might not be in a
+	 * sleepable context
+	 */
+	ret = ret ?: cancel_work(&t->work);
+
 	rcu_read_unlock();
 	return ret;
 }
@@ -1383,11 +1453,13 @@  void bpf_timer_cancel_and_free(void *val)
 	t = timer->timer;
 	if (!t)
 		goto out;
+	spin_lock(&t->sleepable_lock);
 	drop_prog_refcnt(t);
 	/* The subsequent bpf_timer_start/cancel() helpers won't be able to use
 	 * this timer, since it won't be initialized.
 	 */
 	WRITE_ONCE(timer->timer, NULL);
+	spin_unlock(&t->sleepable_lock);
 out:
 	__bpf_spin_unlock_irqrestore(&timer->lock);
 	if (!t)
@@ -1410,6 +1482,16 @@  void bpf_timer_cancel_and_free(void *val)
 	 */
 	if (this_cpu_read(hrtimer_running) != t)
 		hrtimer_cancel(&t->timer);
+
+	/* also cancel the sleepable work, but *do not* wait for
+	 * it to finish if it was running as we might not be in a
+	 * sleepable context. Same reason as above, it's fine to
+	 * free 't': the subprog callback will never access it anymore
+	 * and can not reschedule itself since timer->timer = NULL was
+	 * already done.
+	 */
+	cancel_work(&t->work);
+
 	kfree_rcu(t, rcu);
 }