diff mbox series

[v4,01/40] drm/gpuvm: Don't require obj lock in destructor path

Message ID 20250514175527.42488-2-robdclark@gmail.com
State Superseded
Headers show
Series drm/msm: sparse / "VM_BIND" support | expand

Commit Message

Rob Clark May 14, 2025, 5:53 p.m. UTC
From: Rob Clark <robdclark@chromium.org>

See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in
msm_gem_free_object()") for justification.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/drm_gpuvm.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Danilo Krummrich May 15, 2025, 8:54 a.m. UTC | #1
Hi Rob,

Can you please CC me on patches for GPUVM?

On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in
> msm_gem_free_object()") for justification.

Please write a proper commit message that explains the problem and the solution.
Please don't just refer to another commit and leave it to the reviewer of the
patch to figure this out.

> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/drm_gpuvm.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)


> diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
> index f9eb56f24bef..1e89a98caad4 100644
> --- a/drivers/gpu/drm/drm_gpuvm.c
> +++ b/drivers/gpu/drm/drm_gpuvm.c
> @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref)
>  	drm_gpuvm_bo_list_del(vm_bo, extobj, lock);
>  	drm_gpuvm_bo_list_del(vm_bo, evict, lock);
>  
> -	drm_gem_gpuva_assert_lock_held(obj);
> +	if (kref_read(&obj->refcount) > 0)
> +		drm_gem_gpuva_assert_lock_held(obj);
> +
>  	list_del(&vm_bo->list.entry.gem);

This seems wrong.

A VM_BO object keeps a reference of the underlying GEM object, so this should
never happen.

This function calls drm_gem_object_put() before it returns.

>  
>  	if (ops && ops->vm_bo_free)
> @@ -1871,7 +1873,8 @@ drm_gpuva_unlink(struct drm_gpuva *va)
>  	if (unlikely(!obj))
>  		return;
>  
> -	drm_gem_gpuva_assert_lock_held(obj);
> +	if (kref_read(&obj->refcount) > 0)
> +		drm_gem_gpuva_assert_lock_held(obj);
>  	list_del_init(&va->gem.entry);
>  
>  	va->vm_bo = NULL;
> -- 
> 2.49.0
>
Danilo Krummrich May 15, 2025, 9:06 a.m. UTC | #2
On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote:
> Hi Rob,
> 
> Can you please CC me on patches for GPUVM?
> 
> On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> > 
> > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in
> > msm_gem_free_object()") for justification.
> 
> Please write a proper commit message that explains the problem and the solution.
> Please don't just refer to another commit and leave it to the reviewer of the
> patch to figure this out.
> 
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/drm_gpuvm.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> 
> > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
> > index f9eb56f24bef..1e89a98caad4 100644
> > --- a/drivers/gpu/drm/drm_gpuvm.c
> > +++ b/drivers/gpu/drm/drm_gpuvm.c
> > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref)
> >  	drm_gpuvm_bo_list_del(vm_bo, extobj, lock);
> >  	drm_gpuvm_bo_list_del(vm_bo, evict, lock);
> >  
> > -	drm_gem_gpuva_assert_lock_held(obj);
> > +	if (kref_read(&obj->refcount) > 0)
> > +		drm_gem_gpuva_assert_lock_held(obj);
> > +
> >  	list_del(&vm_bo->list.entry.gem);
> 
> This seems wrong.
> 
> A VM_BO object keeps a reference of the underlying GEM object, so this should
> never happen.
> 
> This function calls drm_gem_object_put() before it returns.

I noticed your subsequent patch that allows VM_BO structures to have weak
references to GEM objects.

However, even with that this seems wrong. If the reference count of the GEM
object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM
object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object
potentially remains to be on the extobj and eviced list, which means that other
code paths might fetch it from those lists and consider it to be a valid GEM
object.
Rob Clark May 15, 2025, 5:35 p.m. UTC | #3
On Thu, May 15, 2025 at 2:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote:
> > Hi Rob,
> >
> > Can you please CC me on patches for GPUVM?
> >
> > On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in
> > > msm_gem_free_object()") for justification.
> >
> > Please write a proper commit message that explains the problem and the solution.
> > Please don't just refer to another commit and leave it to the reviewer of the
> > patch to figure this out.
> >
> > > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > > ---
> > >  drivers/gpu/drm/drm_gpuvm.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> >
> > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
> > > index f9eb56f24bef..1e89a98caad4 100644
> > > --- a/drivers/gpu/drm/drm_gpuvm.c
> > > +++ b/drivers/gpu/drm/drm_gpuvm.c
> > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref)
> > >     drm_gpuvm_bo_list_del(vm_bo, extobj, lock);
> > >     drm_gpuvm_bo_list_del(vm_bo, evict, lock);
> > >
> > > -   drm_gem_gpuva_assert_lock_held(obj);
> > > +   if (kref_read(&obj->refcount) > 0)
> > > +           drm_gem_gpuva_assert_lock_held(obj);
> > > +
> > >     list_del(&vm_bo->list.entry.gem);
> >
> > This seems wrong.
> >
> > A VM_BO object keeps a reference of the underlying GEM object, so this should
> > never happen.
> >
> > This function calls drm_gem_object_put() before it returns.
>
> I noticed your subsequent patch that allows VM_BO structures to have weak
> references to GEM objects.
>
> However, even with that this seems wrong. If the reference count of the GEM
> object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM
> object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object
> potentially remains to be on the extobj and eviced list, which means that other
> code paths might fetch it from those lists and consider it to be a valid GEM
> object.

We only iterate extobj or evicted in VM_BIND mode, where we aren't
using WEAK_REF.  I suppose some WARN_ON()s or BUG_ON()s could make
this more clear.

BR,
-R
Danilo Krummrich May 15, 2025, 5:55 p.m. UTC | #4
On Thu, May 15, 2025 at 10:35:21AM -0700, Rob Clark wrote:
> On Thu, May 15, 2025 at 2:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote:
> > > Hi Rob,
> > >
> > > Can you please CC me on patches for GPUVM?
> > >
> > > On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote:
> > > > From: Rob Clark <robdclark@chromium.org>
> > > >
> > > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in
> > > > msm_gem_free_object()") for justification.
> > >
> > > Please write a proper commit message that explains the problem and the solution.
> > > Please don't just refer to another commit and leave it to the reviewer of the
> > > patch to figure this out.
> > >
> > > > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > > > ---
> > > >  drivers/gpu/drm/drm_gpuvm.c | 7 +++++--
> > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > >
> > > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
> > > > index f9eb56f24bef..1e89a98caad4 100644
> > > > --- a/drivers/gpu/drm/drm_gpuvm.c
> > > > +++ b/drivers/gpu/drm/drm_gpuvm.c
> > > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref)
> > > >     drm_gpuvm_bo_list_del(vm_bo, extobj, lock);
> > > >     drm_gpuvm_bo_list_del(vm_bo, evict, lock);
> > > >
> > > > -   drm_gem_gpuva_assert_lock_held(obj);
> > > > +   if (kref_read(&obj->refcount) > 0)
> > > > +           drm_gem_gpuva_assert_lock_held(obj);
> > > > +
> > > >     list_del(&vm_bo->list.entry.gem);
> > >
> > > This seems wrong.
> > >
> > > A VM_BO object keeps a reference of the underlying GEM object, so this should
> > > never happen.
> > >
> > > This function calls drm_gem_object_put() before it returns.
> >
> > I noticed your subsequent patch that allows VM_BO structures to have weak
> > references to GEM objects.
> >
> > However, even with that this seems wrong. If the reference count of the GEM
> > object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM
> > object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object
> > potentially remains to be on the extobj and eviced list, which means that other
> > code paths might fetch it from those lists and consider it to be a valid GEM
> > object.
> 
> We only iterate extobj or evicted in VM_BIND mode, where we aren't
> using WEAK_REF.  I suppose some WARN_ON()s or BUG_ON()s could make
> this more clear.

There is also the GEM object's list of VM_BOs, are you using that?

Anyways, I don't agree with that. Even if you can tweak your driver to not run
into trouble with this, we can't introduce a mode that violates GOUVM's internal
lifetimes and subsequently fix it up with WARN_ON() or BUG_ON().

I still don't see a real technical reason why msm can't be reworked to follow
those lifetime rules.
Rob Clark May 15, 2025, 9:57 p.m. UTC | #5
On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 10:35:21AM -0700, Rob Clark wrote:
> > On Thu, May 15, 2025 at 2:06 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote:
> > > > Hi Rob,
> > > >
> > > > Can you please CC me on patches for GPUVM?
> > > >
> > > > On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote:
> > > > > From: Rob Clark <robdclark@chromium.org>
> > > > >
> > > > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in
> > > > > msm_gem_free_object()") for justification.
> > > >
> > > > Please write a proper commit message that explains the problem and the solution.
> > > > Please don't just refer to another commit and leave it to the reviewer of the
> > > > patch to figure this out.
> > > >
> > > > > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > > > > ---
> > > > >  drivers/gpu/drm/drm_gpuvm.c | 7 +++++--
> > > > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > >
> > > >
> > > > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
> > > > > index f9eb56f24bef..1e89a98caad4 100644
> > > > > --- a/drivers/gpu/drm/drm_gpuvm.c
> > > > > +++ b/drivers/gpu/drm/drm_gpuvm.c
> > > > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref)
> > > > >     drm_gpuvm_bo_list_del(vm_bo, extobj, lock);
> > > > >     drm_gpuvm_bo_list_del(vm_bo, evict, lock);
> > > > >
> > > > > -   drm_gem_gpuva_assert_lock_held(obj);
> > > > > +   if (kref_read(&obj->refcount) > 0)
> > > > > +           drm_gem_gpuva_assert_lock_held(obj);
> > > > > +
> > > > >     list_del(&vm_bo->list.entry.gem);
> > > >
> > > > This seems wrong.
> > > >
> > > > A VM_BO object keeps a reference of the underlying GEM object, so this should
> > > > never happen.
> > > >
> > > > This function calls drm_gem_object_put() before it returns.
> > >
> > > I noticed your subsequent patch that allows VM_BO structures to have weak
> > > references to GEM objects.
> > >
> > > However, even with that this seems wrong. If the reference count of the GEM
> > > object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM
> > > object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object
> > > potentially remains to be on the extobj and eviced list, which means that other
> > > code paths might fetch it from those lists and consider it to be a valid GEM
> > > object.
> >
> > We only iterate extobj or evicted in VM_BIND mode, where we aren't
> > using WEAK_REF.  I suppose some WARN_ON()s or BUG_ON()s could make
> > this more clear.
>
> There is also the GEM object's list of VM_BOs, are you using that?

yes, but at this point there are no more ref's to the obj, and that
list is obj specific

> Anyways, I don't agree with that. Even if you can tweak your driver to not run
> into trouble with this, we can't introduce a mode that violates GOUVM's internal
> lifetimes and subsequently fix it up with WARN_ON() or BUG_ON().
>
> I still don't see a real technical reason why msm can't be reworked to follow
> those lifetime rules.

The basic issue is that (a) it would be really awkward to have two
side-by-side VM/VMA management/tracking systems.  But in legacy mode,
we have the opposite direction of reference holding.  (But at the same
time, don't need/use most of the features of gpuvm.)

BR,
-R
Rob Clark May 16, 2025, 4:20 p.m. UTC | #6
On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote:
> > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > Anyways, I don't agree with that. Even if you can tweak your driver to not run
> > > into trouble with this, we can't introduce a mode that violates GOUVM's internal
> > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON().
> > >
> > > I still don't see a real technical reason why msm can't be reworked to follow
> > > those lifetime rules.
> >
> > The basic issue is that (a) it would be really awkward to have two
> > side-by-side VM/VMA management/tracking systems.  But in legacy mode,
> > we have the opposite direction of reference holding.  (But at the same
> > time, don't need/use most of the features of gpuvm.)
>
> Ok, let's try to move this forward; I see three options (in order of descending
> preference):
>
>   1) Rework the legacy code to properly work with GPUVM.
>   2) Don't use GPUVM for the legacy mode.
>   .
>   .
>   .
>   3) Get an ACK from Dave / Sima to implement those workarounds for MSM in
>      GPUVM.
>
> If you go for 3), the code introduced by those two patches should be guarded
> with a flag that makes it very clear that this is a workaround specifically
> for MSM legacy mode and does not give any guarantees in terms of correctness
> regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK.

I'm not even sure how #2 would work, other than just copy/pasta all of
drm_gpuvm into msm, which doesn't really seem great.

As for #1, even if I could get it to work, it would still be a lot
more mmu map/unmap (like on every pageflip, vs the current state that
the vma is kept around until the object is freed).  For the
non-VM_BIND world, there are advantages to the BO holding the ref to
the VMA, rather than the other way around.  Even at just a modest
single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap
costs a tlbinv).  So from that standpoint, #3 is the superior option.

BR,
-R
Dave Airlie May 20, 2025, 9:25 p.m. UTC | #7
On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote:
>
> On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote:
> >
> > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote:
> > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run
> > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal
> > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON().
> > > >
> > > > I still don't see a real technical reason why msm can't be reworked to follow
> > > > those lifetime rules.
> > >
> > > The basic issue is that (a) it would be really awkward to have two
> > > side-by-side VM/VMA management/tracking systems.  But in legacy mode,
> > > we have the opposite direction of reference holding.  (But at the same
> > > time, don't need/use most of the features of gpuvm.)
> >
> > Ok, let's try to move this forward; I see three options (in order of descending
> > preference):
> >
> >   1) Rework the legacy code to properly work with GPUVM.
> >   2) Don't use GPUVM for the legacy mode.
> >   .
> >   .
> >   .
> >   3) Get an ACK from Dave / Sima to implement those workarounds for MSM in
> >      GPUVM.
> >
> > If you go for 3), the code introduced by those two patches should be guarded
> > with a flag that makes it very clear that this is a workaround specifically
> > for MSM legacy mode and does not give any guarantees in terms of correctness
> > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK.
>
> I'm not even sure how #2 would work, other than just copy/pasta all of
> drm_gpuvm into msm, which doesn't really seem great.
>
> As for #1, even if I could get it to work, it would still be a lot
> more mmu map/unmap (like on every pageflip, vs the current state that
> the vma is kept around until the object is freed).  For the
> non-VM_BIND world, there are advantages to the BO holding the ref to
> the VMA, rather than the other way around.  Even at just a modest
> single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap
> costs a tlbinv).  So from that standpoint, #3 is the superior option.
>

Before we get to #3, I'll need a bit more info here on why you have to
map/unmap the VMA on every pageflip.

But actually I think 2 is the best option, I think in nouveau this is
where we ended up, we didn't modify the old submission paths at all
and kept the old bo/vm lifetimes.

We just added completely new bind/exec ioctls and you can only use one
method once you've opened an fd.

Dave.
Rob Clark May 20, 2025, 9:52 p.m. UTC | #8
On Tue, May 20, 2025 at 2:25 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > >
> > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote:
> > > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run
> > > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal
> > > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON().
> > > > >
> > > > > I still don't see a real technical reason why msm can't be reworked to follow
> > > > > those lifetime rules.
> > > >
> > > > The basic issue is that (a) it would be really awkward to have two
> > > > side-by-side VM/VMA management/tracking systems.  But in legacy mode,
> > > > we have the opposite direction of reference holding.  (But at the same
> > > > time, don't need/use most of the features of gpuvm.)
> > >
> > > Ok, let's try to move this forward; I see three options (in order of descending
> > > preference):
> > >
> > >   1) Rework the legacy code to properly work with GPUVM.
> > >   2) Don't use GPUVM for the legacy mode.
> > >   .
> > >   .
> > >   .
> > >   3) Get an ACK from Dave / Sima to implement those workarounds for MSM in
> > >      GPUVM.
> > >
> > > If you go for 3), the code introduced by those two patches should be guarded
> > > with a flag that makes it very clear that this is a workaround specifically
> > > for MSM legacy mode and does not give any guarantees in terms of correctness
> > > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK.
> >
> > I'm not even sure how #2 would work, other than just copy/pasta all of
> > drm_gpuvm into msm, which doesn't really seem great.
> >
> > As for #1, even if I could get it to work, it would still be a lot
> > more mmu map/unmap (like on every pageflip, vs the current state that
> > the vma is kept around until the object is freed).  For the
> > non-VM_BIND world, there are advantages to the BO holding the ref to
> > the VMA, rather than the other way around.  Even at just a modest
> > single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap
> > costs a tlbinv).  So from that standpoint, #3 is the superior option.
> >
>
> Before we get to #3, I'll need a bit more info here on why you have to
> map/unmap the VMA on every pageflip.

Previously we'd keep the VMA hanging around until the GEM obj is
freed.  But that can't work if the VMA (via the VM_BO) is holding a
reference to the GEM obj.

I was kinda thinking about keeping the VMA around until the handle is
closed.. but that doesn't cover the dma-buf case (ie. when you
re-import the dma-buf fd each frame.. I know android does this, unsure
about other wsi's).

> But actually I think 2 is the best option, I think in nouveau this is
> where we ended up, we didn't modify the old submission paths at all
> and kept the old bo/vm lifetimes.
>
> We just added completely new bind/exec ioctls and you can only use one
> method once you've opened an fd.

hmm, but that means tracking VMAs against a single BO differently..
which.. at least seems ugly..

BR,
-R
Dave Airlie May 20, 2025, 10:31 p.m. UTC | #9
On Wed, 21 May 2025 at 07:53, Rob Clark <robdclark@gmail.com> wrote:
>
> On Tue, May 20, 2025 at 2:25 PM Dave Airlie <airlied@gmail.com> wrote:
> >
> > On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote:
> > >
> > > On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > >
> > > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote:
> > > > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run
> > > > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal
> > > > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON().
> > > > > >
> > > > > > I still don't see a real technical reason why msm can't be reworked to follow
> > > > > > those lifetime rules.
> > > > >
> > > > > The basic issue is that (a) it would be really awkward to have two
> > > > > side-by-side VM/VMA management/tracking systems.  But in legacy mode,
> > > > > we have the opposite direction of reference holding.  (But at the same
> > > > > time, don't need/use most of the features of gpuvm.)
> > > >
> > > > Ok, let's try to move this forward; I see three options (in order of descending
> > > > preference):
> > > >
> > > >   1) Rework the legacy code to properly work with GPUVM.
> > > >   2) Don't use GPUVM for the legacy mode.
> > > >   .
> > > >   .
> > > >   .
> > > >   3) Get an ACK from Dave / Sima to implement those workarounds for MSM in
> > > >      GPUVM.
> > > >
> > > > If you go for 3), the code introduced by those two patches should be guarded
> > > > with a flag that makes it very clear that this is a workaround specifically
> > > > for MSM legacy mode and does not give any guarantees in terms of correctness
> > > > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK.
> > >
> > > I'm not even sure how #2 would work, other than just copy/pasta all of
> > > drm_gpuvm into msm, which doesn't really seem great.
> > >
> > > As for #1, even if I could get it to work, it would still be a lot
> > > more mmu map/unmap (like on every pageflip, vs the current state that
> > > the vma is kept around until the object is freed).  For the
> > > non-VM_BIND world, there are advantages to the BO holding the ref to
> > > the VMA, rather than the other way around.  Even at just a modest
> > > single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap
> > > costs a tlbinv).  So from that standpoint, #3 is the superior option.
> > >
> >
> > Before we get to #3, I'll need a bit more info here on why you have to
> > map/unmap the VMA on every pageflip.
>
> Previously we'd keep the VMA hanging around until the GEM obj is
> freed.  But that can't work if the VMA (via the VM_BO) is holding a
> reference to the GEM obj.
>
> I was kinda thinking about keeping the VMA around until the handle is
> closed.. but that doesn't cover the dma-buf case (ie. when you
> re-import the dma-buf fd each frame.. I know android does this, unsure
> about other wsi's).
>
> > But actually I think 2 is the best option, I think in nouveau this is
> > where we ended up, we didn't modify the old submission paths at all
> > and kept the old bo/vm lifetimes.
> >
> > We just added completely new bind/exec ioctls and you can only use one
> > method once you've opened an fd.
>
> hmm, but that means tracking VMAs against a single BO differently..
> which.. at least seems ugly..

I don't think it is if you already have the code to do that, and just
add gpuvm support in parallel.

You also have to figure out that the world is moving towards Vulkan
for everything so any optimisations you've made for particular legacy
paths will need to be dealt with in the future picture anyways.

But I'd rather not hack gpuvm into being something it isn't, if there
is a meaningful commonality in legacy bo/vm bindings across drivers,
we could create something new, but the ref counting and handling is
pretty fundamental to gpuvm architecture.

There should only be two paths, legacy and gpuvm, and you shouldn't
ever be mixing them on a particular exec path, since you should only
have a vm per userspace fd, and can pick which way to use it the first
time someone calls it.

Dave.
Dave.
Rob Clark May 20, 2025, 10:56 p.m. UTC | #10
On Tue, May 20, 2025 at 3:31 PM Dave Airlie <airlied@gmail.com> wrote:
>
> On Wed, 21 May 2025 at 07:53, Rob Clark <robdclark@gmail.com> wrote:
> >
> > On Tue, May 20, 2025 at 2:25 PM Dave Airlie <airlied@gmail.com> wrote:
> > >
> > > On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote:
> > > >
> > > > On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > >
> > > > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote:
> > > > > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote:
> > > > > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run
> > > > > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal
> > > > > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON().
> > > > > > >
> > > > > > > I still don't see a real technical reason why msm can't be reworked to follow
> > > > > > > those lifetime rules.
> > > > > >
> > > > > > The basic issue is that (a) it would be really awkward to have two
> > > > > > side-by-side VM/VMA management/tracking systems.  But in legacy mode,
> > > > > > we have the opposite direction of reference holding.  (But at the same
> > > > > > time, don't need/use most of the features of gpuvm.)
> > > > >
> > > > > Ok, let's try to move this forward; I see three options (in order of descending
> > > > > preference):
> > > > >
> > > > >   1) Rework the legacy code to properly work with GPUVM.
> > > > >   2) Don't use GPUVM for the legacy mode.
> > > > >   .
> > > > >   .
> > > > >   .
> > > > >   3) Get an ACK from Dave / Sima to implement those workarounds for MSM in
> > > > >      GPUVM.
> > > > >
> > > > > If you go for 3), the code introduced by those two patches should be guarded
> > > > > with a flag that makes it very clear that this is a workaround specifically
> > > > > for MSM legacy mode and does not give any guarantees in terms of correctness
> > > > > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK.
> > > >
> > > > I'm not even sure how #2 would work, other than just copy/pasta all of
> > > > drm_gpuvm into msm, which doesn't really seem great.
> > > >
> > > > As for #1, even if I could get it to work, it would still be a lot
> > > > more mmu map/unmap (like on every pageflip, vs the current state that
> > > > the vma is kept around until the object is freed).  For the
> > > > non-VM_BIND world, there are advantages to the BO holding the ref to
> > > > the VMA, rather than the other way around.  Even at just a modest
> > > > single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap
> > > > costs a tlbinv).  So from that standpoint, #3 is the superior option.
> > > >
> > >
> > > Before we get to #3, I'll need a bit more info here on why you have to
> > > map/unmap the VMA on every pageflip.
> >
> > Previously we'd keep the VMA hanging around until the GEM obj is
> > freed.  But that can't work if the VMA (via the VM_BO) is holding a
> > reference to the GEM obj.
> >
> > I was kinda thinking about keeping the VMA around until the handle is
> > closed.. but that doesn't cover the dma-buf case (ie. when you
> > re-import the dma-buf fd each frame.. I know android does this, unsure
> > about other wsi's).
> >
> > > But actually I think 2 is the best option, I think in nouveau this is
> > > where we ended up, we didn't modify the old submission paths at all
> > > and kept the old bo/vm lifetimes.
> > >
> > > We just added completely new bind/exec ioctls and you can only use one
> > > method once you've opened an fd.
> >
> > hmm, but that means tracking VMAs against a single BO differently..
> > which.. at least seems ugly..
>
> I don't think it is if you already have the code to do that, and just
> add gpuvm support in parallel.
>
> You also have to figure out that the world is moving towards Vulkan
> for everything so any optimisations you've made for particular legacy
> paths will need to be dealt with in the future picture anyways.

fwiw, the case I'm more worried about is the kms vm for scanout, that
won't be using vk

BR,
-R

> But I'd rather not hack gpuvm into being something it isn't, if there
> is a meaningful commonality in legacy bo/vm bindings across drivers,
> we could create something new, but the ref counting and handling is
> pretty fundamental to gpuvm architecture.
>
> There should only be two paths, legacy and gpuvm, and you shouldn't
> ever be mixing them on a particular exec path, since you should only
> have a vm per userspace fd, and can pick which way to use it the first
> time someone calls it.
>
> Dave.
> Dave.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c
index f9eb56f24bef..1e89a98caad4 100644
--- a/drivers/gpu/drm/drm_gpuvm.c
+++ b/drivers/gpu/drm/drm_gpuvm.c
@@ -1511,7 +1511,9 @@  drm_gpuvm_bo_destroy(struct kref *kref)
 	drm_gpuvm_bo_list_del(vm_bo, extobj, lock);
 	drm_gpuvm_bo_list_del(vm_bo, evict, lock);
 
-	drm_gem_gpuva_assert_lock_held(obj);
+	if (kref_read(&obj->refcount) > 0)
+		drm_gem_gpuva_assert_lock_held(obj);
+
 	list_del(&vm_bo->list.entry.gem);
 
 	if (ops && ops->vm_bo_free)
@@ -1871,7 +1873,8 @@  drm_gpuva_unlink(struct drm_gpuva *va)
 	if (unlikely(!obj))
 		return;
 
-	drm_gem_gpuva_assert_lock_held(obj);
+	if (kref_read(&obj->refcount) > 0)
+		drm_gem_gpuva_assert_lock_held(obj);
 	list_del_init(&va->gem.entry);
 
 	va->vm_bo = NULL;