[v4,04/40] drm/sched: Add enqueue credit limit

Message ID	20250514170118.40555-5-robdclark@gmail.com
State	New
Headers	show Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B64D1C84A4; Wed, 14 May 2025 17:03:23 +0000 (UTC) From: Rob Clark <robdclark@gmail.com> To: dri-devel@lists.freedesktop.org Cc: freedreno@lists.freedesktop.org, linux-arm-msm@vger.kernel.org, Connor Abbott <cwabbott0@gmail.com>, Rob Clark <robdclark@chromium.org>, Matthew Brost <matthew.brost@intel.com>, Danilo Krummrich <dakr@kernel.org>, Philipp Stanner <phasta@kernel.org>, =?utf-8?q?Christian_K=C3=B6nig?= <ckoenig.leichtzumerken@gmail.com>, Maarten Lankhorst <maarten.lankhorst@linux.intel.com>, Maxime Ripard <mripard@kernel.org>, Thomas Zimmermann <tzimmermann@suse.de>, David Airlie <airlied@gmail.com>, Simona Vetter <simona@ffwll.ch>, linux-kernel@vger.kernel.org (open list) Subject: [PATCH v4 04/40] drm/sched: Add enqueue credit limit Date: Wed, 14 May 2025 09:59:03 -0700 Message-ID: <20250514170118.40555-5-robdclark@gmail.com> In-Reply-To: <20250514170118.40555-1-robdclark@gmail.com> References: <20250514170118.40555-1-robdclark@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	drm/msm: sparse / "VM_BIND" support \| expand [v4,00/40] drm/msm: sparse / "VM_BIND" support [v4,01/40] drm/gpuvm: Don't require obj lock in destructor path [v4,02/40] drm/gpuvm: Allow VAs to hold soft reference to BOs [v4,03/40] drm/gem: Add ww_acquire_ctx support to drm_gem_lru_scan() [v4,04/40] drm/sched: Add enqueue credit limit [v4,05/40] iommu/io-pgtable-arm: Add quirk to quiet WARN_ON() [v4,06/40] drm/msm: Rename msm_file_private -> msm_context [v4,07/40] drm/msm: Improve msm_context comments [v4,08/40] drm/msm: Rename msm_gem_address_space -> msm_gem_vm [v4,09/40] drm/msm: Remove vram carveout support [v4,10/40] drm/msm: Collapse vma allocation and initialization [v4,11/40] drm/msm: Collapse vma close and delete

Rob Clark May 14, 2025, 4:59 p.m. UTC

From: Rob Clark <robdclark@chromium.org>

Similar to the existing credit limit mechanism, but applying to jobs
enqueued to the scheduler but not yet run.

The use case is to put an upper bound on preallocated, and potentially
unneeded, pgtable pages.  When this limit is exceeded, pushing new jobs
will block until the count drops below the limit.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++++--
 drivers/gpu/drm/scheduler/sched_main.c   |  3 +++
 include/drm/gpu_scheduler.h              | 13 ++++++++++++-
 3 files changed, 29 insertions(+), 3 deletions(-)

Philipp Stanner May 15, 2025, 9:28 a.m. UTC | #1

Hello,

On Wed, 2025-05-14 at 09:59 -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Similar to the existing credit limit mechanism, but applying to jobs
> enqueued to the scheduler but not yet run.
> 
> The use case is to put an upper bound on preallocated, and
> potentially
> unneeded, pgtable pages.  When this limit is exceeded, pushing new
> jobs
> will block until the count drops below the limit.

the commit message doesn't make clear why that's needed within the
scheduler.

Connor Abbott May 15, 2025, 4:22 p.m. UTC | #2

On Thu, May 15, 2025 at 12:15 PM Rob Clark <robdclark@chromium.org> wrote:
>
> On Thu, May 15, 2025 at 2:28 AM Philipp Stanner <phasta@mailbox.org> wrote:
> >
> > Hello,
> >
> > On Wed, 2025-05-14 at 09:59 -0700, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > Similar to the existing credit limit mechanism, but applying to jobs
> > > enqueued to the scheduler but not yet run.
> > >
> > > The use case is to put an upper bound on preallocated, and
> > > potentially
> > > unneeded, pgtable pages.  When this limit is exceeded, pushing new
> > > jobs
> > > will block until the count drops below the limit.
> >
> > the commit message doesn't make clear why that's needed within the
> > scheduler.
> >
> > From what I understand from the cover letter, this is a (rare?) Vulkan
> > feature. And as important as Vulkan is, it's the drivers that implement
> > support for it. I don't see why the scheduler is a blocker.
>
> Maybe not rare, or at least it comes up with a group of deqp-vk tests ;-)
>
> Basically it is a way to throttle userspace to prevent it from OoM'ing
> itself.  (I suppose userspace could throttle itself, but it doesn't
> really know how much pre-allocation will need to be done for pgtable
> updates.)

For some context, other drivers have the concept of a "synchronous"
VM_BIND ioctl which completes immediately, and drivers implement it by
waiting for the whole thing to finish before returning. But this
doesn't work for native context, where everything has to be
asynchronous, so we're trying a new approach where we instead submit
an asynchronous bind for "normal" (non-sparse/driver internal)
allocations and only attach its out-fence to the in-fence of
subsequent submits to other queues. Once you do this then you need a
limit like this to prevent memory usage from pending page table
updates from getting out of control. Other drivers haven't needed this
yet, but they will when they get native context support.

Connor

>
> > All the knowledge about when to stop pushing into the entity is in the
> > driver, and the scheduler obtains all the knowledge about that from the
> > driver anyways.
> >
> > So you could do
> >
> > if (my_vulkan_condition())
> >    drm_sched_entity_push_job();
> >
> > couldn't you?
>
> It would need to reach in and use the sched's job_scheduled
> wait_queue_head_t...  if that isn't too ugly, maybe the rest could be
> implemented on top of sched.  But it seemed like a reasonable thing
> for the scheduler to support directly.
>
> > >
> > > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_entity.c | 16 ++++++++++++++--
> > >  drivers/gpu/drm/scheduler/sched_main.c   |  3 +++
> > >  include/drm/gpu_scheduler.h              | 13 ++++++++++++-
> > >  3 files changed, 29 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
> > > b/drivers/gpu/drm/scheduler/sched_entity.c
> > > index dc0e60d2c14b..c5f688362a34 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > @@ -580,11 +580,21 @@ void drm_sched_entity_select_rq(struct
> > > drm_sched_entity *entity)
> > >   * under common lock for the struct drm_sched_entity that was set up
> > > for
> > >   * @sched_job in drm_sched_job_init().
> > >   */
> > > -void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > > +int drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> >
> > Return code would need to be documented in the docstring, too. If we'd
> > go for that solution.
> >
> > >  {
> > >       struct drm_sched_entity *entity = sched_job->entity;
> > > +     struct drm_gpu_scheduler *sched = sched_job->sched;
> > >       bool first;
> > >       ktime_t submit_ts;
> > > +     int ret;
> > > +
> > > +     ret = wait_event_interruptible(
> > > +                     sched->job_scheduled,
> > > +                     atomic_read(&sched->enqueue_credit_count) <=
> > > +                     sched->enqueue_credit_limit);
> >
> > This very significantly changes the function's semantics. This function
> > is used in a great many drivers, and here it would be transformed into
> > a function that can block.
> >
> > From what I see below those credits are to be optional. But even if, it
> > needs to be clearly documented when a function can block.
>
> Sure.  The behavior changes only for drivers that use the
> enqueue_credit_limit, so other drivers should be unaffected.
>
> I can improve the docs.
>
> (Maybe push_credit or something else would be a better name than
> enqueue_credit?)
>
> >
> > > +     if (ret)
> > > +             return ret;
> > > +     atomic_add(sched_job->enqueue_credits, &sched-
> > > >enqueue_credit_count);
> > >
> > >       trace_drm_sched_job(sched_job, entity);
> > >       atomic_inc(entity->rq->sched->score);
> > > @@ -609,7 +619,7 @@ void drm_sched_entity_push_job(struct
> > > drm_sched_job *sched_job)
> > >                       spin_unlock(&entity->lock);
> > >
> > >                       DRM_ERROR("Trying to push to a killed
> > > entity\n");
> > > -                     return;
> > > +                     return -EINVAL;
> > >               }
> > >
> > >               rq = entity->rq;
> > > @@ -626,5 +636,7 @@ void drm_sched_entity_push_job(struct
> > > drm_sched_job *sched_job)
> > >
> > >               drm_sched_wakeup(sched);
> > >       }
> > > +
> > > +     return 0;
> > >  }
> > >  EXPORT_SYMBOL(drm_sched_entity_push_job);
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 9412bffa8c74..1102cca69cb4 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1217,6 +1217,7 @@ static void drm_sched_run_job_work(struct
> > > work_struct *w)
> > >
> > >       trace_drm_run_job(sched_job, entity);
> > >       fence = sched->ops->run_job(sched_job);
> > > +     atomic_sub(sched_job->enqueue_credits, &sched-
> > > >enqueue_credit_count);
> > >       complete_all(&entity->entity_idle);
> > >       drm_sched_fence_scheduled(s_fence, fence);
> > >
> > > @@ -1253,6 +1254,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > > *sched, const struct drm_sched_init_
> > >
> > >       sched->ops = args->ops;
> > >       sched->credit_limit = args->credit_limit;
> > > +     sched->enqueue_credit_limit = args->enqueue_credit_limit;
> > >       sched->name = args->name;
> > >       sched->timeout = args->timeout;
> > >       sched->hang_limit = args->hang_limit;
> > > @@ -1308,6 +1310,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > > *sched, const struct drm_sched_init_
> > >       INIT_LIST_HEAD(&sched->pending_list);
> > >       spin_lock_init(&sched->job_list_lock);
> > >       atomic_set(&sched->credit_count, 0);
> > > +     atomic_set(&sched->enqueue_credit_count, 0);
> > >       INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > >       INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > >       INIT_WORK(&sched->work_free_job, drm_sched_free_job_work);
> > > diff --git a/include/drm/gpu_scheduler.h
> > > b/include/drm/gpu_scheduler.h
> > > index da64232c989d..d830ffe083f1 100644
> > > --- a/include/drm/gpu_scheduler.h
> > > +++ b/include/drm/gpu_scheduler.h
> > > @@ -329,6 +329,7 @@ struct drm_sched_fence *to_drm_sched_fence(struct
> > > dma_fence *f);
> > >   * @s_fence: contains the fences for the scheduling of job.
> > >   * @finish_cb: the callback for the finished fence.
> > >   * @credits: the number of credits this job contributes to the
> > > scheduler
> > > + * @enqueue_credits: the number of enqueue credits this job
> > > contributes
> > >   * @work: Helper to reschedule job kill to different context.
> > >   * @id: a unique id assigned to each job scheduled on the scheduler.
> > >   * @karma: increment on every hang caused by this job. If this
> > > exceeds the hang
> > > @@ -366,6 +367,7 @@ struct drm_sched_job {
> > >
> > >       enum drm_sched_priority         s_priority;
> > >       u32                             credits;
> > > +     u32                             enqueue_credits;
> >
> > What's the policy of setting this?
> >
> > drm_sched_job_init() and drm_sched_job_arm() are responsible for
> > initializing jobs.
>
> It should be set before drm_sched_entity_push_job().  I wouldn't
> really expect drivers to know the value at drm_sched_job_init() time.
> But they would by the time drm_sched_entity_push_job() is called.
>
> > >       /** @last_dependency: tracks @dependencies as they signal */
> > >       unsigned int                    last_dependency;
> > >       atomic_t                        karma;
> > > @@ -485,6 +487,10 @@ struct drm_sched_backend_ops {
> > >   * @ops: backend operations provided by the driver.
> > >   * @credit_limit: the credit limit of this scheduler
> > >   * @credit_count: the current credit count of this scheduler
> > > + * @enqueue_credit_limit: the credit limit of jobs pushed to
> > > scheduler and not
> > > + *                        yet run
> > > + * @enqueue_credit_count: the current crdit count of jobs pushed to
> > > scheduler
> > > + *                        but not yet run
> > >   * @timeout: the time after which a job is removed from the
> > > scheduler.
> > >   * @name: name of the ring for which this scheduler is being used.
> > >   * @num_rqs: Number of run-queues. This is at most
> > > DRM_SCHED_PRIORITY_COUNT,
> > > @@ -518,6 +524,8 @@ struct drm_gpu_scheduler {
> > >       const struct drm_sched_backend_ops      *ops;
> > >       u32                             credit_limit;
> > >       atomic_t                        credit_count;
> > > +     u32                             enqueue_credit_limit;
> > > +     atomic_t                        enqueue_credit_count;
> > >       long                            timeout;
> > >       const char                      *name;
> > >       u32                             num_rqs;
> > > @@ -550,6 +558,8 @@ struct drm_gpu_scheduler {
> > >   * @num_rqs: Number of run-queues. This may be at most
> > > DRM_SCHED_PRIORITY_COUNT,
> > >   *        as there's usually one run-queue per priority, but may
> > > be less.
> > >   * @credit_limit: the number of credits this scheduler can hold from
> > > all jobs
> > > + * @enqueue_credit_limit: the number of credits that can be enqueued
> > > before
> > > + *                        drm_sched_entity_push_job() blocks
> >
> > Is it optional or not? Can it be deactivated?
> >
> > It seems to me that it is optional, and so far only used in msm. If
> > there are no other parties in need for that mechanism, the right place
> > to have this feature probably is msm, which has all the knowledge about
> > when to block already.
> >
>
> As with the existing credit_limit, it is optional.  Although I think
> it would be also useful for other drivers that use drm sched for
> VM_BIND queues, for the same reason.
>
> BR,
> -R
>
> >
> > Regards
> > P.
> >
> >
> > >   * @hang_limit: number of times to allow a job to hang before
> > > dropping it.
> > >   *           This mechanism is DEPRECATED. Set it to 0.
> > >   * @timeout: timeout value in jiffies for submitted jobs.
> > > @@ -564,6 +574,7 @@ struct drm_sched_init_args {
> > >       struct workqueue_struct *timeout_wq;
> > >       u32 num_rqs;
> > >       u32 credit_limit;
> > > +     u32 enqueue_credit_limit;
> > >       unsigned int hang_limit;
> > >       long timeout;
> > >       atomic_t *score;
> > > @@ -600,7 +611,7 @@ int drm_sched_job_init(struct drm_sched_job *job,
> > >                      struct drm_sched_entity *entity,
> > >                      u32 credits, void *owner);
> > >  void drm_sched_job_arm(struct drm_sched_job *job);
> > > -void drm_sched_entity_push_job(struct drm_sched_job *sched_job);
> > > +int drm_sched_entity_push_job(struct drm_sched_job *sched_job);
> > >  int drm_sched_job_add_dependency(struct drm_sched_job *job,
> > >                                struct dma_fence *fence);
> > >  int drm_sched_job_add_syncobj_dependency(struct drm_sched_job *job,
> >

Danilo Krummrich May 15, 2025, 5:29 p.m. UTC | #3

(Cc: Boris)

On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> For some context, other drivers have the concept of a "synchronous"
> VM_BIND ioctl which completes immediately, and drivers implement it by
> waiting for the whole thing to finish before returning.

Nouveau implements sync by issuing a normal async VM_BIND and subsequently
waits for the out-fence synchronously.

> But this
> doesn't work for native context, where everything has to be
> asynchronous, so we're trying a new approach where we instead submit
> an asynchronous bind for "normal" (non-sparse/driver internal)
> allocations and only attach its out-fence to the in-fence of
> subsequent submits to other queues.

This is what nouveau does and I think other drivers like Xe and panthor do this
as well.

> Once you do this then you need a
> limit like this to prevent memory usage from pending page table
> updates from getting out of control. Other drivers haven't needed this
> yet, but they will when they get native context support.

What are the cases where you did run into this, i.e. which application in
userspace hit this? Was it the CTS, some game, something else?

Rob Clark May 15, 2025, 5:36 p.m. UTC | #4

On Thu, May 15, 2025 at 10:23 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 09:15:08AM -0700, Rob Clark wrote:
> > Basically it is a way to throttle userspace to prevent it from OoM'ing
> > itself.  (I suppose userspace could throttle itself, but it doesn't
> > really know how much pre-allocation will need to be done for pgtable
> > updates.)
>
> I assume you mean prevent a single process from OOM'ing itself by queuing up
> VM_BIND requests much faster than they can be completed and hence
> pre-allocations for page tables get out of control?

Yes

Danilo Krummrich May 15, 2025, 6:56 p.m. UTC | #5

On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > (Cc: Boris)
> >
> > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > For some context, other drivers have the concept of a "synchronous"
> > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > waiting for the whole thing to finish before returning.
> >
> > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > waits for the out-fence synchronously.
> 
> As Connor mentioned, we'd prefer it to be async rather than blocking,
> in normal cases, otherwise with drm native context for using native
> UMD in guest VM, you'd be blocking the single host/VMM virglrender
> thread.
> 
> The key is we want to keep it async in the normal cases, and not have
> weird edge case CTS tests blow up from being _too_ async ;-)

I really wonder why they don't blow up in Nouveau, which also support full
asynchronous VM_BIND. Mind sharing which tests blow up? :)

> > > But this
> > > doesn't work for native context, where everything has to be
> > > asynchronous, so we're trying a new approach where we instead submit
> > > an asynchronous bind for "normal" (non-sparse/driver internal)
> > > allocations and only attach its out-fence to the in-fence of
> > > subsequent submits to other queues.
> >
> > This is what nouveau does and I think other drivers like Xe and panthor do this
> > as well.
> 
> No one has added native context support for these drivers yet

Huh? What exactly do you mean with "native context" then?

> > > Once you do this then you need a
> > > limit like this to prevent memory usage from pending page table
> > > updates from getting out of control. Other drivers haven't needed this
> > > yet, but they will when they get native context support.
> >
> > What are the cases where you did run into this, i.e. which application in
> > userspace hit this? Was it the CTS, some game, something else?
> 
> CTS tests that do weird things with massive # of small bind/unbind.  I
> wouldn't expect to hit the blocking case in the real world.

As mentioned above, can you please share them? I'd like to play around a bit. :)

- Danilo

Danilo Krummrich May 20, 2025, 7:06 a.m. UTC | #6

On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > >
> > > > (Cc: Boris)
> > > >
> > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > For some context, other drivers have the concept of a "synchronous"
> > > > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > > > waiting for the whole thing to finish before returning.
> > > >
> > > > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > > > waits for the out-fence synchronously.
> > >
> > > As Connor mentioned, we'd prefer it to be async rather than blocking,
> > > in normal cases, otherwise with drm native context for using native
> > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > thread.
> > >
> > > The key is we want to keep it async in the normal cases, and not have
> > > weird edge case CTS tests blow up from being _too_ async ;-)
> >
> > I really wonder why they don't blow up in Nouveau, which also support full
> > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> 
> Maybe it was dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,

The test above is part of the smoke testing I do for nouveau, but I haven't seen
such issues yet for nouveau.

> but I might be mixing that up, I'd have to back out this patch and see
> where things blow up, which would take many hours.

Well, you said that you never had this issue with "real" workloads, but only
with VK CTS, so I really think we should know what we are trying to fix here.

We can't just add new generic infrastructure without reasonable and *well
understood* justification.

> There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> so absolutely throttling like this is needed.

I still don't understand why the kernel must throttle this? If userspace uses
async VM_BIND, it obviously can't spam the kernel infinitely without running
into an OOM case.

But let's assume we agree that we want to avoid that userspace can ever OOM itself
through async VM_BIND, then the proposed solution seems wrong:

Do we really want the driver developer to set an arbitrary boundary of a number
of jobs that can be submitted before *async* VM_BIND blocks and becomes
semi-sync?

How do we choose this number of jobs? A very small number to be safe, which
scales badly on powerful machines? A large number that scales well on powerful
machines, but OOMs on weaker ones?

I really think, this isn't the correct solution, but more a workaround.

> Part of the VM_BIND for msm series adds some tracepoints for amount of
> memory preallocated vs used for each job.  That plus scheduler
> tracepoints should let you see how much memory is tied up in
> prealloc'd pgtables.  You might not be noticing only because you are
> running on a big desktop with lots of RAM ;-)
> 
> > > > > But this
> > > > > doesn't work for native context, where everything has to be
> > > > > asynchronous, so we're trying a new approach where we instead submit
> > > > > an asynchronous bind for "normal" (non-sparse/driver internal)
> > > > > allocations and only attach its out-fence to the in-fence of
> > > > > subsequent submits to other queues.
> > > >
> > > > This is what nouveau does and I think other drivers like Xe and panthor do this
> > > > as well.
> > >
> > > No one has added native context support for these drivers yet
> >
> > Huh? What exactly do you mean with "native context" then?
> 
> It is a way to use native usermode driver in a guest VM, by remoting
> at the UAPI level, as opposed to the vk or gl API level.  You can
> generally get equal to native performance, but the guest/host boundary
> strongly encourages asynchronous to hide the guest->host latency.

For the context we're discussing this isn't different to other drivers supporing
async VM_BIND utilizing it from the host, rather than from a guest.

So, my original statement about nouveau, Xe, panthor doing the same thing
without running into trouble should be valid.

Rob Clark May 20, 2025, 4:07 p.m. UTC | #7

On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > >
> > > > > (Cc: Boris)
> > > > >
> > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > > For some context, other drivers have the concept of a "synchronous"
> > > > > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > > > > waiting for the whole thing to finish before returning.
> > > > >
> > > > > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > > > > waits for the out-fence synchronously.
> > > >
> > > > As Connor mentioned, we'd prefer it to be async rather than blocking,
> > > > in normal cases, otherwise with drm native context for using native
> > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > > thread.
> > > >
> > > > The key is we want to keep it async in the normal cases, and not have
> > > > weird edge case CTS tests blow up from being _too_ async ;-)
> > >
> > > I really wonder why they don't blow up in Nouveau, which also support full
> > > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> >
> > Maybe it was dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,
>
> The test above is part of the smoke testing I do for nouveau, but I haven't seen
> such issues yet for nouveau.

nouveau is probably not using async binds for everything?  Or maybe
I'm just pointing to the wrong test.

> > but I might be mixing that up, I'd have to back out this patch and see
> > where things blow up, which would take many hours.
>
> Well, you said that you never had this issue with "real" workloads, but only
> with VK CTS, so I really think we should know what we are trying to fix here.
>
> We can't just add new generic infrastructure without reasonable and *well
> understood* justification.

What is not well understood about this?  We need to pre-allocate
memory that we likely don't need for pagetables.

In the worst case, a large # of async PAGE_SIZE binds, you end up
needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page
of mapping.  Queue up enough of those and you can explode your memory
usage.

> > There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> > so absolutely throttling like this is needed.
>
> I still don't understand why the kernel must throttle this? If userspace uses
> async VM_BIND, it obviously can't spam the kernel infinitely without running
> into an OOM case.

It is a valid question about whether the kernel or userspace should be
the one to do the throttling.

I went for doing it in the kernel because the kernel has better
knowledge of how much it needs to pre-allocate.

(There is also the side point, that this pre-allocated memory is not
charged to the calling process from a PoV of memory accounting.  So
with that in mind it seems like a good idea for the kernel to throttle
memory usage.)

> But let's assume we agree that we want to avoid that userspace can ever OOM itself
> through async VM_BIND, then the proposed solution seems wrong:
>
> Do we really want the driver developer to set an arbitrary boundary of a number
> of jobs that can be submitted before *async* VM_BIND blocks and becomes
> semi-sync?
>
> How do we choose this number of jobs? A very small number to be safe, which
> scales badly on powerful machines? A large number that scales well on powerful
> machines, but OOMs on weaker ones?

The way I am using it in msm, the credit amount and limit are in units
of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
1024 pages, once there are jobs queued up exceeding that limit, they
start blocking.

The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.

> I really think, this isn't the correct solution, but more a workaround.
>
> > Part of the VM_BIND for msm series adds some tracepoints for amount of
> > memory preallocated vs used for each job.  That plus scheduler
> > tracepoints should let you see how much memory is tied up in
> > prealloc'd pgtables.  You might not be noticing only because you are
> > running on a big desktop with lots of RAM ;-)
> >
> > > > > > But this
> > > > > > doesn't work for native context, where everything has to be
> > > > > > asynchronous, so we're trying a new approach where we instead submit
> > > > > > an asynchronous bind for "normal" (non-sparse/driver internal)
> > > > > > allocations and only attach its out-fence to the in-fence of
> > > > > > subsequent submits to other queues.
> > > > >
> > > > > This is what nouveau does and I think other drivers like Xe and panthor do this
> > > > > as well.
> > > >
> > > > No one has added native context support for these drivers yet
> > >
> > > Huh? What exactly do you mean with "native context" then?
> >
> > It is a way to use native usermode driver in a guest VM, by remoting
> > at the UAPI level, as opposed to the vk or gl API level.  You can
> > generally get equal to native performance, but the guest/host boundary
> > strongly encourages asynchronous to hide the guest->host latency.
>
> For the context we're discussing this isn't different to other drivers supporing
> async VM_BIND utilizing it from the host, rather than from a guest.
>
> So, my original statement about nouveau, Xe, panthor doing the same thing
> without running into trouble should be valid.

Probably the difference is that we don't do any _synchronous_ binds.
And that is partially motivated by the virtual machine case.

BR,
-R

Danilo Krummrich May 20, 2025, 4:54 p.m. UTC | #8

On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> > > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > >
> > > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > >
> > > > > > (Cc: Boris)
> > > > > >
> > > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > > > For some context, other drivers have the concept of a "synchronous"
> > > > > > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > > > > > waiting for the whole thing to finish before returning.
> > > > > >
> > > > > > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > > > > > waits for the out-fence synchronously.
> > > > >
> > > > > As Connor mentioned, we'd prefer it to be async rather than blocking,
> > > > > in normal cases, otherwise with drm native context for using native
> > > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > > > thread.
> > > > >
> > > > > The key is we want to keep it async in the normal cases, and not have
> > > > > weird edge case CTS tests blow up from being _too_ async ;-)
> > > >
> > > > I really wonder why they don't blow up in Nouveau, which also support full
> > > > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> > >
> > > Maybe it was dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,
> >
> > The test above is part of the smoke testing I do for nouveau, but I haven't seen
> > such issues yet for nouveau.
> 
> nouveau is probably not using async binds for everything?  Or maybe
> I'm just pointing to the wrong test.

Let me double check later on.

> > > but I might be mixing that up, I'd have to back out this patch and see
> > > where things blow up, which would take many hours.
> >
> > Well, you said that you never had this issue with "real" workloads, but only
> > with VK CTS, so I really think we should know what we are trying to fix here.
> >
> > We can't just add new generic infrastructure without reasonable and *well
> > understood* justification.
> 
> What is not well understood about this?  We need to pre-allocate
> memory that we likely don't need for pagetables.
> 
> In the worst case, a large # of async PAGE_SIZE binds, you end up
> needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page
> of mapping.  Queue up enough of those and you can explode your memory
> usage.

Well, the general principle how this can OOM is well understood, sure. What's
not well understood is how we run in this case. I think we should also
understand what test causes the issue and why other drivers are not affected
(yet).

> > > There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> > > so absolutely throttling like this is needed.
> >
> > I still don't understand why the kernel must throttle this? If userspace uses
> > async VM_BIND, it obviously can't spam the kernel infinitely without running
> > into an OOM case.
> 
> It is a valid question about whether the kernel or userspace should be
> the one to do the throttling.
> 
> I went for doing it in the kernel because the kernel has better
> knowledge of how much it needs to pre-allocate.
> 
> (There is also the side point, that this pre-allocated memory is not
> charged to the calling process from a PoV of memory accounting.  So
> with that in mind it seems like a good idea for the kernel to throttle
> memory usage.)

That's a very valid point, maybe we should investigate in the direction of
addressing this, rather than trying to work around it in the scheduler, where we
can only set an arbitrary credit limit.

> > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > through async VM_BIND, then the proposed solution seems wrong:
> >
> > Do we really want the driver developer to set an arbitrary boundary of a number
> > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > semi-sync?
> >
> > How do we choose this number of jobs? A very small number to be safe, which
> > scales badly on powerful machines? A large number that scales well on powerful
> > machines, but OOMs on weaker ones?
> 
> The way I am using it in msm, the credit amount and limit are in units
> of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> 1024 pages, once there are jobs queued up exceeding that limit, they
> start blocking.
> 
> The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.

That doesn't make a difference for my question. How do you know 1024 pages is a
good value? How do we scale for different machines with different capabilities?

If you have a powerful machine with lots of memory, we might throttle userspace
for no reason, no?

If the machine has very limited resources, it might already be too much?

Connor Abbott May 20, 2025, 5:05 p.m. UTC | #9

n Tue, May 20, 2025 at 12:54 PM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> > > > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > >
> > > > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > > >
> > > > > > > (Cc: Boris)
> > > > > > >
> > > > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > > > > For some context, other drivers have the concept of a "synchronous"
> > > > > > > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > > > > > > waiting for the whole thing to finish before returning.
> > > > > > >
> > > > > > > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > > > > > > waits for the out-fence synchronously.
> > > > > >
> > > > > > As Connor mentioned, we'd prefer it to be async rather than blocking,
> > > > > > in normal cases, otherwise with drm native context for using native
> > > > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > > > > thread.
> > > > > >
> > > > > > The key is we want to keep it async in the normal cases, and not have
> > > > > > weird edge case CTS tests blow up from being _too_ async ;-)
> > > > >
> > > > > I really wonder why they don't blow up in Nouveau, which also support full
> > > > > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> > > >
> > > > Maybe it was dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,
> > >
> > > The test above is part of the smoke testing I do for nouveau, but I haven't seen
> > > such issues yet for nouveau.
> >
> > nouveau is probably not using async binds for everything?  Or maybe
> > I'm just pointing to the wrong test.
>
> Let me double check later on.
>
> > > > but I might be mixing that up, I'd have to back out this patch and see
> > > > where things blow up, which would take many hours.
> > >
> > > Well, you said that you never had this issue with "real" workloads, but only
> > > with VK CTS, so I really think we should know what we are trying to fix here.
> > >
> > > We can't just add new generic infrastructure without reasonable and *well
> > > understood* justification.
> >
> > What is not well understood about this?  We need to pre-allocate
> > memory that we likely don't need for pagetables.
> >
> > In the worst case, a large # of async PAGE_SIZE binds, you end up
> > needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page
> > of mapping.  Queue up enough of those and you can explode your memory
> > usage.
>
> Well, the general principle how this can OOM is well understood, sure. What's
> not well understood is how we run in this case. I think we should also
> understand what test causes the issue and why other drivers are not affected
> (yet).

Once again, it's well understood why other drivers aren't affected.
They have both synchronous and asynchronous VM_BINDs in the uabi, and
the userspace driver uses synchronous VM_BIND for everything except
sparse mappings. For freedreno we tried to change that because async
works better for native context, which exposed the pre-existing issue
with async VM_BINDs causing the whole system to hang when we run out
of memory since more mappings started being async.

I think it would be possible in theory for other drivers to forward
synchronous VM_BINDs asynchronously to the host as long as the host
kernel executes them synchronously, so maybe other drivers won't have
a problem with native context support. But it will still be possible
to make them fall over if you poke them the right way.

Connor

>
> > > > There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> > > > so absolutely throttling like this is needed.
> > >
> > > I still don't understand why the kernel must throttle this? If userspace uses
> > > async VM_BIND, it obviously can't spam the kernel infinitely without running
> > > into an OOM case.
> >
> > It is a valid question about whether the kernel or userspace should be
> > the one to do the throttling.
> >
> > I went for doing it in the kernel because the kernel has better
> > knowledge of how much it needs to pre-allocate.
> >
> > (There is also the side point, that this pre-allocated memory is not
> > charged to the calling process from a PoV of memory accounting.  So
> > with that in mind it seems like a good idea for the kernel to throttle
> > memory usage.)
>
> That's a very valid point, maybe we should investigate in the direction of
> addressing this, rather than trying to work around it in the scheduler, where we
> can only set an arbitrary credit limit.
>
> > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > through async VM_BIND, then the proposed solution seems wrong:
> > >
> > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > semi-sync?
> > >
> > > How do we choose this number of jobs? A very small number to be safe, which
> > > scales badly on powerful machines? A large number that scales well on powerful
> > > machines, but OOMs on weaker ones?
> >
> > The way I am using it in msm, the credit amount and limit are in units
> > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > 1024 pages, once there are jobs queued up exceeding that limit, they
> > start blocking.
> >
> > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
>
> That doesn't make a difference for my question. How do you know 1024 pages is a
> good value? How do we scale for different machines with different capabilities?
>
> If you have a powerful machine with lots of memory, we might throttle userspace
> for no reason, no?
>
> If the machine has very limited resources, it might already be too much?

Rob Clark May 20, 2025, 5:22 p.m. UTC | #10

On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 12:56:38PM -0700, Rob Clark wrote:
> > > > On Thu, May 15, 2025 at 11:56 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > >
> > > > > On Thu, May 15, 2025 at 10:40:15AM -0700, Rob Clark wrote:
> > > > > > On Thu, May 15, 2025 at 10:30 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > > >
> > > > > > > (Cc: Boris)
> > > > > > >
> > > > > > > On Thu, May 15, 2025 at 12:22:18PM -0400, Connor Abbott wrote:
> > > > > > > > For some context, other drivers have the concept of a "synchronous"
> > > > > > > > VM_BIND ioctl which completes immediately, and drivers implement it by
> > > > > > > > waiting for the whole thing to finish before returning.
> > > > > > >
> > > > > > > Nouveau implements sync by issuing a normal async VM_BIND and subsequently
> > > > > > > waits for the out-fence synchronously.
> > > > > >
> > > > > > As Connor mentioned, we'd prefer it to be async rather than blocking,
> > > > > > in normal cases, otherwise with drm native context for using native
> > > > > > UMD in guest VM, you'd be blocking the single host/VMM virglrender
> > > > > > thread.
> > > > > >
> > > > > > The key is we want to keep it async in the normal cases, and not have
> > > > > > weird edge case CTS tests blow up from being _too_ async ;-)
> > > > >
> > > > > I really wonder why they don't blow up in Nouveau, which also support full
> > > > > asynchronous VM_BIND. Mind sharing which tests blow up? :)
> > > >
> > > > Maybe it was dEQP-VK.sparse_resources.buffer.ssbo.sparse_residency.buffer_size_2_24,
> > >
> > > The test above is part of the smoke testing I do for nouveau, but I haven't seen
> > > such issues yet for nouveau.
> >
> > nouveau is probably not using async binds for everything?  Or maybe
> > I'm just pointing to the wrong test.
>
> Let me double check later on.
>
> > > > but I might be mixing that up, I'd have to back out this patch and see
> > > > where things blow up, which would take many hours.
> > >
> > > Well, you said that you never had this issue with "real" workloads, but only
> > > with VK CTS, so I really think we should know what we are trying to fix here.
> > >
> > > We can't just add new generic infrastructure without reasonable and *well
> > > understood* justification.
> >
> > What is not well understood about this?  We need to pre-allocate
> > memory that we likely don't need for pagetables.
> >
> > In the worst case, a large # of async PAGE_SIZE binds, you end up
> > needing to pre-allocate 3 pgtable pages (4 lvl pgtable) per one page
> > of mapping.  Queue up enough of those and you can explode your memory
> > usage.
>
> Well, the general principle how this can OOM is well understood, sure. What's
> not well understood is how we run in this case. I think we should also
> understand what test causes the issue and why other drivers are not affected
> (yet).
>
> > > > There definitely was one where I was seeing >5k VM_BIND jobs pile up,
> > > > so absolutely throttling like this is needed.
> > >
> > > I still don't understand why the kernel must throttle this? If userspace uses
> > > async VM_BIND, it obviously can't spam the kernel infinitely without running
> > > into an OOM case.
> >
> > It is a valid question about whether the kernel or userspace should be
> > the one to do the throttling.
> >
> > I went for doing it in the kernel because the kernel has better
> > knowledge of how much it needs to pre-allocate.
> >
> > (There is also the side point, that this pre-allocated memory is not
> > charged to the calling process from a PoV of memory accounting.  So
> > with that in mind it seems like a good idea for the kernel to throttle
> > memory usage.)
>
> That's a very valid point, maybe we should investigate in the direction of
> addressing this, rather than trying to work around it in the scheduler, where we
> can only set an arbitrary credit limit.

Perhaps.. but that seems like a bigger can of worms

> > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > through async VM_BIND, then the proposed solution seems wrong:
> > >
> > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > semi-sync?
> > >
> > > How do we choose this number of jobs? A very small number to be safe, which
> > > scales badly on powerful machines? A large number that scales well on powerful
> > > machines, but OOMs on weaker ones?
> >
> > The way I am using it in msm, the credit amount and limit are in units
> > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > 1024 pages, once there are jobs queued up exceeding that limit, they
> > start blocking.
> >
> > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
>
> That doesn't make a difference for my question. How do you know 1024 pages is a
> good value? How do we scale for different machines with different capabilities?
>
> If you have a powerful machine with lots of memory, we might throttle userspace
> for no reason, no?
>
> If the machine has very limited resources, it might already be too much?

It may be a bit arbitrary, but then again I'm not sure that userspace
is in any better position to pick an appropriate limit.

4MB of in-flight pages isn't going to be too much for anything that is
capable enough to run vk, but still allows for a lot of in-flight
maps.  As I mentioned before, I don't expect anyone to hit this case
normally, unless they are just trying to poke the driver in weird
ways.  Having the kernel guard against that doesn't seem unreasonable.

BR,
-R

Danilo Krummrich May 22, 2025, 11 a.m. UTC | #11

On Tue, May 20, 2025 at 10:22:54AM -0700, Rob Clark wrote:
> On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > > through async VM_BIND, then the proposed solution seems wrong:
> > > >
> > > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > > semi-sync?
> > > >
> > > > How do we choose this number of jobs? A very small number to be safe, which
> > > > scales badly on powerful machines? A large number that scales well on powerful
> > > > machines, but OOMs on weaker ones?
> > >
> > > The way I am using it in msm, the credit amount and limit are in units
> > > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > > 1024 pages, once there are jobs queued up exceeding that limit, they
> > > start blocking.
> > >
> > > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
> >
> > That doesn't make a difference for my question. How do you know 1024 pages is a
> > good value? How do we scale for different machines with different capabilities?
> >
> > If you have a powerful machine with lots of memory, we might throttle userspace
> > for no reason, no?
> >
> > If the machine has very limited resources, it might already be too much?
> 
> It may be a bit arbitrary, but then again I'm not sure that userspace
> is in any better position to pick an appropriate limit.
> 
> 4MB of in-flight pages isn't going to be too much for anything that is
> capable enough to run vk, but still allows for a lot of in-flight
> maps.

Ok, but what about the other way around? What's the performance impact if the
limit is chosen rather small, but we're running on a very powerful machine?

Since you already have the implementation for hardware you have access to, can
you please check if and how performance degrades when you use a very small
threshold?

Also, I think we should probably put this throttle mechanism in a separate
component, that just wraps a counter of bytes or rather pages that can be
increased and decreased through an API and the increase just blocks at a certain
threshold.

This component can then be called by a driver from the job submit IOCTL and the
corresponding place where the pre-allocated memory is actually used / freed.

Depending on the driver, this might not necessarily be in the scheduler's
run_job() callback.

We could call the component something like drm_throttle or drm_submit_throttle.

Rob Clark May 22, 2025, 2:47 p.m. UTC | #12

On Thu, May 22, 2025 at 4:00 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Tue, May 20, 2025 at 10:22:54AM -0700, Rob Clark wrote:
> > On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > > > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > > > through async VM_BIND, then the proposed solution seems wrong:
> > > > >
> > > > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > > > semi-sync?
> > > > >
> > > > > How do we choose this number of jobs? A very small number to be safe, which
> > > > > scales badly on powerful machines? A large number that scales well on powerful
> > > > > machines, but OOMs on weaker ones?
> > > >
> > > > The way I am using it in msm, the credit amount and limit are in units
> > > > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > > > 1024 pages, once there are jobs queued up exceeding that limit, they
> > > > start blocking.
> > > >
> > > > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
> > >
> > > That doesn't make a difference for my question. How do you know 1024 pages is a
> > > good value? How do we scale for different machines with different capabilities?
> > >
> > > If you have a powerful machine with lots of memory, we might throttle userspace
> > > for no reason, no?
> > >
> > > If the machine has very limited resources, it might already be too much?
> >
> > It may be a bit arbitrary, but then again I'm not sure that userspace
> > is in any better position to pick an appropriate limit.
> >
> > 4MB of in-flight pages isn't going to be too much for anything that is
> > capable enough to run vk, but still allows for a lot of in-flight
> > maps.
>
> Ok, but what about the other way around? What's the performance impact if the
> limit is chosen rather small, but we're running on a very powerful machine?
>
> Since you already have the implementation for hardware you have access to, can
> you please check if and how performance degrades when you use a very small
> threshold?

I mean, considering that some drivers (asahi, at least), _only_
implement synchronous VM_BIND, I guess blocking in extreme cases isn't
so bad.  But I think you are overthinking this.  4MB of pagetables is
enough to map ~8GB of buffers.

Perhaps drivers would want to set their limit based on the amount of
memory the GPU could map, which might land them on a # larger than
1024, but still not an order of magnitude more.

I don't really have a good setup for testing games that use this, atm,
fex-emu isn't working for me atm.  But I think Connor has a setup with
proton working?

But, flip it around.  It is pretty simple to create a test program
that submits a flood of 4k (or whatever your min page size is)
VM_BINDs, and see how prealloc memory usage blows up.  This is really
the thing this patch is trying to protect against.

> Also, I think we should probably put this throttle mechanism in a separate
> component, that just wraps a counter of bytes or rather pages that can be
> increased and decreased through an API and the increase just blocks at a certain
> threshold.

Maybe?  I don't see why we need to explicitly define the units for the
credit.  This wasn't done for the existing credit mechanism.. which,
seems like if you used some extra fences could also have been
implemented externally.

> This component can then be called by a driver from the job submit IOCTL and the
> corresponding place where the pre-allocated memory is actually used / freed.
>
> Depending on the driver, this might not necessarily be in the scheduler's
> run_job() callback.
>
> We could call the component something like drm_throttle or drm_submit_throttle.

Maybe?  This still has the same complaint I had about just
implementing this in msm.. it would have to reach in and use the
scheduler's job_scheduled wait-queue.  Which, to me at least, seems
like more of an internal detail about how the scheduler works.

BR,
-R

Danilo Krummrich May 22, 2025, 3:53 p.m. UTC | #13

On Thu, May 22, 2025 at 07:47:17AM -0700, Rob Clark wrote:
> On Thu, May 22, 2025 at 4:00 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > On Tue, May 20, 2025 at 10:22:54AM -0700, Rob Clark wrote:
> > > On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > > > > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > > > > through async VM_BIND, then the proposed solution seems wrong:
> > > > > >
> > > > > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > > > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > > > > semi-sync?
> > > > > >
> > > > > > How do we choose this number of jobs? A very small number to be safe, which
> > > > > > scales badly on powerful machines? A large number that scales well on powerful
> > > > > > machines, but OOMs on weaker ones?
> > > > >
> > > > > The way I am using it in msm, the credit amount and limit are in units
> > > > > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > > > > 1024 pages, once there are jobs queued up exceeding that limit, they
> > > > > start blocking.
> > > > >
> > > > > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
> > > >
> > > > That doesn't make a difference for my question. How do you know 1024 pages is a
> > > > good value? How do we scale for different machines with different capabilities?
> > > >
> > > > If you have a powerful machine with lots of memory, we might throttle userspace
> > > > for no reason, no?
> > > >
> > > > If the machine has very limited resources, it might already be too much?
> > >
> > > It may be a bit arbitrary, but then again I'm not sure that userspace
> > > is in any better position to pick an appropriate limit.
> > >
> > > 4MB of in-flight pages isn't going to be too much for anything that is
> > > capable enough to run vk, but still allows for a lot of in-flight
> > > maps.
> >
> > Ok, but what about the other way around? What's the performance impact if the
> > limit is chosen rather small, but we're running on a very powerful machine?
> >
> > Since you already have the implementation for hardware you have access to, can
> > you please check if and how performance degrades when you use a very small
> > threshold?
> 
> I mean, considering that some drivers (asahi, at least), _only_
> implement synchronous VM_BIND, I guess blocking in extreme cases isn't
> so bad.

Which is not even upstream yet and eventually will support async VM_BIND too,
AFAIK.

> But I think you are overthinking this.  4MB of pagetables is
> enough to map ~8GB of buffers.
> 
> Perhaps drivers would want to set their limit based on the amount of
> memory the GPU could map, which might land them on a # larger than
> 1024, but still not an order of magnitude more.

Nouveau currently supports an address space width of 128TiB.

In general, we have to cover the range of some small laptop or handheld devices
to huge datacenter machines.

> I don't really have a good setup for testing games that use this, atm,
> fex-emu isn't working for me atm.  But I think Connor has a setup with
> proton working?

I just want to be sure that an arbitrary small limit doing the job for a small
device to not fail VK CTS can't regress the performance on large machines.

So, kindly try to prove that we're not prone to extreme performance regression
with a static value as you propose.

> > Also, I think we should probably put this throttle mechanism in a separate
> > component, that just wraps a counter of bytes or rather pages that can be
> > increased and decreased through an API and the increase just blocks at a certain
> > threshold.
> 
> Maybe?  I don't see why we need to explicitly define the units for the
> credit.  This wasn't done for the existing credit mechanism.. which,
> seems like if you used some extra fences could also have been
> implemented externally.

If you are referring to the credit mechanism in the scheduler for ring buffers,
that's a different case. Drivers know the size of their ring buffers exactly and
the scheduler has the responsibility of when to submit tasks to the ring buffer.
So the scheduler kind of owns the resource.

However, the throttle mechanism you propose is independent from the scheduler,
it depends on the available system memory, a resource the scheduler doesn't own.

I'm fine to make the unit credits as well, but in this case we really care about
the consumption of system memory, so we could just use an applicable unit.

> > This component can then be called by a driver from the job submit IOCTL and the
> > corresponding place where the pre-allocated memory is actually used / freed.
> >
> > Depending on the driver, this might not necessarily be in the scheduler's
> > run_job() callback.
> >
> > We could call the component something like drm_throttle or drm_submit_throttle.
> 
> Maybe?  This still has the same complaint I had about just
> implementing this in msm.. it would have to reach in and use the
> scheduler's job_scheduled wait-queue.  Which, to me at least, seems
> like more of an internal detail about how the scheduler works.

Why? The component should use its own waitqueue. Subsequently, from your code
that releases the pre-allocated memory, you can decrement the counter through
the drm_throttle API, which automatically kicks its the waitqueue.

For instance from your VM_BIND IOCTL you can call

	drm_throttle_inc(value)

which blocks if the increment goes above the threshold. And when you release the
pre-allocated memory you call

	drm_throttle_dec(value)

which wakes the waitqueue and unblocks the drm_throttle_inc() call from your
VM_BIND IOCTL.

Another advantage is that, if necessary, we can make drm_throttle
(automatically) scale for the machines resources, which otherwise we'd need to
pollute the scheduler with.

Rob Clark May 23, 2025, 2:31 a.m. UTC | #14

On Thu, May 22, 2025 at 8:53 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Thu, May 22, 2025 at 07:47:17AM -0700, Rob Clark wrote:
> > On Thu, May 22, 2025 at 4:00 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > On Tue, May 20, 2025 at 10:22:54AM -0700, Rob Clark wrote:
> > > > On Tue, May 20, 2025 at 9:54 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > On Tue, May 20, 2025 at 09:07:05AM -0700, Rob Clark wrote:
> > > > > > On Tue, May 20, 2025 at 12:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > > > But let's assume we agree that we want to avoid that userspace can ever OOM itself
> > > > > > > through async VM_BIND, then the proposed solution seems wrong:
> > > > > > >
> > > > > > > Do we really want the driver developer to set an arbitrary boundary of a number
> > > > > > > of jobs that can be submitted before *async* VM_BIND blocks and becomes
> > > > > > > semi-sync?
> > > > > > >
> > > > > > > How do we choose this number of jobs? A very small number to be safe, which
> > > > > > > scales badly on powerful machines? A large number that scales well on powerful
> > > > > > > machines, but OOMs on weaker ones?
> > > > > >
> > > > > > The way I am using it in msm, the credit amount and limit are in units
> > > > > > of pre-allocated pages in-flight.  I set the enqueue_credit_limit to
> > > > > > 1024 pages, once there are jobs queued up exceeding that limit, they
> > > > > > start blocking.
> > > > > >
> > > > > > The number of _jobs_ is irrelevant, it is # of pre-alloc'd pages in flight.
> > > > >
> > > > > That doesn't make a difference for my question. How do you know 1024 pages is a
> > > > > good value? How do we scale for different machines with different capabilities?
> > > > >
> > > > > If you have a powerful machine with lots of memory, we might throttle userspace
> > > > > for no reason, no?
> > > > >
> > > > > If the machine has very limited resources, it might already be too much?
> > > >
> > > > It may be a bit arbitrary, but then again I'm not sure that userspace
> > > > is in any better position to pick an appropriate limit.
> > > >
> > > > 4MB of in-flight pages isn't going to be too much for anything that is
> > > > capable enough to run vk, but still allows for a lot of in-flight
> > > > maps.
> > >
> > > Ok, but what about the other way around? What's the performance impact if the
> > > limit is chosen rather small, but we're running on a very powerful machine?
> > >
> > > Since you already have the implementation for hardware you have access to, can
> > > you please check if and how performance degrades when you use a very small
> > > threshold?
> >
> > I mean, considering that some drivers (asahi, at least), _only_
> > implement synchronous VM_BIND, I guess blocking in extreme cases isn't
> > so bad.
>
> Which is not even upstream yet and eventually will support async VM_BIND too,
> AFAIK.

the uapi is upstream

> > But I think you are overthinking this.  4MB of pagetables is
> > enough to map ~8GB of buffers.
> >
> > Perhaps drivers would want to set their limit based on the amount of
> > memory the GPU could map, which might land them on a # larger than
> > 1024, but still not an order of magnitude more.
>
> Nouveau currently supports an address space width of 128TiB.
>
> In general, we have to cover the range of some small laptop or handheld devices
> to huge datacenter machines.

sure.. and?  It is still up to the user of sched to set their own
limits, I'm not proposing that sched takes charge of that policy

Maybe msm doesn't have to scale up quite as much (yet).. but it has to
scale quite a bit further down (like watches).  In the end it is the
same.  And also not really the point here.

> > I don't really have a good setup for testing games that use this, atm,
> > fex-emu isn't working for me atm.  But I think Connor has a setup with
> > proton working?
>
> I just want to be sure that an arbitrary small limit doing the job for a small
> device to not fail VK CTS can't regress the performance on large machines.

why are we debating the limit I set outside of sched.. even that might
be subject to some tuning for devices that have more memory, but that
really outside the scope of this patch

> So, kindly try to prove that we're not prone to extreme performance regression
> with a static value as you propose.
>
> > > Also, I think we should probably put this throttle mechanism in a separate
> > > component, that just wraps a counter of bytes or rather pages that can be
> > > increased and decreased through an API and the increase just blocks at a certain
> > > threshold.
> >
> > Maybe?  I don't see why we need to explicitly define the units for the
> > credit.  This wasn't done for the existing credit mechanism.. which,
> > seems like if you used some extra fences could also have been
> > implemented externally.
>
> If you are referring to the credit mechanism in the scheduler for ring buffers,
> that's a different case. Drivers know the size of their ring buffers exactly and
> the scheduler has the responsibility of when to submit tasks to the ring buffer.
> So the scheduler kind of owns the resource.
>
> However, the throttle mechanism you propose is independent from the scheduler,
> it depends on the available system memory, a resource the scheduler doesn't own.

it is a distinction that is perhaps a matter of opinion.  I don't see
such a big difference, it is all just a matter of managing physical
resource usage in different stages of a scheduled job's lifetime.

> I'm fine to make the unit credits as well, but in this case we really care about
> the consumption of system memory, so we could just use an applicable unit.
>
> > > This component can then be called by a driver from the job submit IOCTL and the
> > > corresponding place where the pre-allocated memory is actually used / freed.
> > >
> > > Depending on the driver, this might not necessarily be in the scheduler's
> > > run_job() callback.
> > >
> > > We could call the component something like drm_throttle or drm_submit_throttle.
> >
> > Maybe?  This still has the same complaint I had about just
> > implementing this in msm.. it would have to reach in and use the
> > scheduler's job_scheduled wait-queue.  Which, to me at least, seems
> > like more of an internal detail about how the scheduler works.
>
> Why? The component should use its own waitqueue. Subsequently, from your code
> that releases the pre-allocated memory, you can decrement the counter through
> the drm_throttle API, which automatically kicks its the waitqueue.
>
> For instance from your VM_BIND IOCTL you can call
>
>         drm_throttle_inc(value)
>
> which blocks if the increment goes above the threshold. And when you release the
> pre-allocated memory you call
>
>         drm_throttle_dec(value)
>
> which wakes the waitqueue and unblocks the drm_throttle_inc() call from your
> VM_BIND IOCTL.

ok, sure, we could introduce another waitqueue, but with my proposal
that is not needed.  And like I said, the existing throttling could
also be implemented externally to the scheduler..  so I'm not seeing
any fundamental difference.

> Another advantage is that, if necessary, we can make drm_throttle
> (automatically) scale for the machines resources, which otherwise we'd need to
> pollute the scheduler with.

How is this different from drivers being more sophisticated about
picking the limit we configure the scheduler with?

Sure, maybe just setting a hard coded limit of 1024 might not be the
final solution.. maybe we should take into consideration the size of
the device.  But this is also entirely outside of the scheduler and I
fail to understand why we are discussing this here?

BR,
-R

Danilo Krummrich May 23, 2025, 6:58 a.m. UTC | #15

On Thu, May 22, 2025 at 07:31:28PM -0700, Rob Clark wrote:
> On Thu, May 22, 2025 at 8:53 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > On Thu, May 22, 2025 at 07:47:17AM -0700, Rob Clark wrote:
> > > On Thu, May 22, 2025 at 4:00 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > Ok, but what about the other way around? What's the performance impact if the
> > > > limit is chosen rather small, but we're running on a very powerful machine?
> > > >
> > > > Since you already have the implementation for hardware you have access to, can
> > > > you please check if and how performance degrades when you use a very small
> > > > threshold?
> > >
> > > I mean, considering that some drivers (asahi, at least), _only_
> > > implement synchronous VM_BIND, I guess blocking in extreme cases isn't
> > > so bad.
> >
> > Which is not even upstream yet and eventually will support async VM_BIND too,
> > AFAIK.
> 
> the uapi is upstream

And will be extended once they have the corresponding async implementation in
the driver.

> > > But I think you are overthinking this.  4MB of pagetables is
> > > enough to map ~8GB of buffers.
> > >
> > > Perhaps drivers would want to set their limit based on the amount of
> > > memory the GPU could map, which might land them on a # larger than
> > > 1024, but still not an order of magnitude more.
> >
> > Nouveau currently supports an address space width of 128TiB.
> >
> > In general, we have to cover the range of some small laptop or handheld devices
> > to huge datacenter machines.
> 
> sure.. and?  It is still up to the user of sched to set their own
> limits, I'm not proposing that sched takes charge of that policy
> 
> Maybe msm doesn't have to scale up quite as much (yet).. but it has to
> scale quite a bit further down (like watches).  In the end it is the
> same.  And also not really the point here.
> 
> > > I don't really have a good setup for testing games that use this, atm,
> > > fex-emu isn't working for me atm.  But I think Connor has a setup with
> > > proton working?
> >
> > I just want to be sure that an arbitrary small limit doing the job for a small
> > device to not fail VK CTS can't regress the performance on large machines.
> 
> why are we debating the limit I set outside of sched.. even that might
> be subject to some tuning for devices that have more memory, but that
> really outside the scope of this patch

We are not debating the number you set in MSM, we're talking about whether a
statically set number will be sufficient.

Also, do we really want it to be our quality standard that we introduce some
throttling mechanism as generic infrastructure for driver and don't even add a
comment guiding drivers how to choose a proper limit and what are the potential
pitfalls in choosing the limit?

When working on a driver, do you want to run into APIs that don't give you
proper guidance on how to use them correctly?

I think it would not be very nice to tell drivers, "Look, here's a throttling API
for when VK CTS (unknown test) ruins your day. We also can't give any advise on
the limit that should be set depending on the scale of the machine, since we
never looked into it.".

> > So, kindly try to prove that we're not prone to extreme performance regression
> > with a static value as you propose.
> >
> > > > Also, I think we should probably put this throttle mechanism in a separate
> > > > component, that just wraps a counter of bytes or rather pages that can be
> > > > increased and decreased through an API and the increase just blocks at a certain
> > > > threshold.
> > >
> > > Maybe?  I don't see why we need to explicitly define the units for the
> > > credit.  This wasn't done for the existing credit mechanism.. which,
> > > seems like if you used some extra fences could also have been
> > > implemented externally.
> >
> > If you are referring to the credit mechanism in the scheduler for ring buffers,
> > that's a different case. Drivers know the size of their ring buffers exactly and
> > the scheduler has the responsibility of when to submit tasks to the ring buffer.
> > So the scheduler kind of owns the resource.
> >
> > However, the throttle mechanism you propose is independent from the scheduler,
> > it depends on the available system memory, a resource the scheduler doesn't own.
> 
> it is a distinction that is perhaps a matter of opinion.  I don't see
> such a big difference, it is all just a matter of managing physical
> resource usage in different stages of a scheduled job's lifetime.

Yes, but the ring buffer as a resource is owned by the scheduler, and hence
having the scheduler care about flow control makes sense.

Here you want to flow control the uAPI (i.e. VM_BIND ioctl) -- let's do this in
a seaparate component please.

> > > Maybe?  This still has the same complaint I had about just
> > > implementing this in msm.. it would have to reach in and use the
> > > scheduler's job_scheduled wait-queue.  Which, to me at least, seems
> > > like more of an internal detail about how the scheduler works.
> >
> > Why? The component should use its own waitqueue. Subsequently, from your code
> > that releases the pre-allocated memory, you can decrement the counter through
> > the drm_throttle API, which automatically kicks its the waitqueue.
> >
> > For instance from your VM_BIND IOCTL you can call
> >
> >         drm_throttle_inc(value)
> >
> > which blocks if the increment goes above the threshold. And when you release the
> > pre-allocated memory you call
> >
> >         drm_throttle_dec(value)
> >
> > which wakes the waitqueue and unblocks the drm_throttle_inc() call from your
> > VM_BIND IOCTL.
> 
> ok, sure, we could introduce another waitqueue, but with my proposal
> that is not needed.  And like I said, the existing throttling could
> also be implemented externally to the scheduler..  so I'm not seeing
> any fundamental difference.

Yes, but you also implicitly force drivers to actually release the pre-allocated
memory before the scheduler's internal waitqueue is woken. Having such implicit
rules isn't nice.

Also, with that drivers would need to do so in run_job(), i.e. in the fence
signalling critical path, which some drivers may not be able to do.

And, it also adds complexity to the scheduler, which we're trying to reduce.

All this goes away with making this a separate component -- please do that
instead.

[v4,04/40] drm/sched: Add enqueue credit limit

Commit Message

Comments

Patch