Message ID | 20250514175527.42488-2-robdclark@gmail.com |
---|---|
State | Superseded |
Headers | show |
Series | drm/msm: sparse / "VM_BIND" support | expand |
Hi Rob, Can you please CC me on patches for GPUVM? On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in > msm_gem_free_object()") for justification. Please write a proper commit message that explains the problem and the solution. Please don't just refer to another commit and leave it to the reviewer of the patch to figure this out. > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/drm_gpuvm.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c > index f9eb56f24bef..1e89a98caad4 100644 > --- a/drivers/gpu/drm/drm_gpuvm.c > +++ b/drivers/gpu/drm/drm_gpuvm.c > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref) > drm_gpuvm_bo_list_del(vm_bo, extobj, lock); > drm_gpuvm_bo_list_del(vm_bo, evict, lock); > > - drm_gem_gpuva_assert_lock_held(obj); > + if (kref_read(&obj->refcount) > 0) > + drm_gem_gpuva_assert_lock_held(obj); > + > list_del(&vm_bo->list.entry.gem); This seems wrong. A VM_BO object keeps a reference of the underlying GEM object, so this should never happen. This function calls drm_gem_object_put() before it returns. > > if (ops && ops->vm_bo_free) > @@ -1871,7 +1873,8 @@ drm_gpuva_unlink(struct drm_gpuva *va) > if (unlikely(!obj)) > return; > > - drm_gem_gpuva_assert_lock_held(obj); > + if (kref_read(&obj->refcount) > 0) > + drm_gem_gpuva_assert_lock_held(obj); > list_del_init(&va->gem.entry); > > va->vm_bo = NULL; > -- > 2.49.0 >
On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote: > Hi Rob, > > Can you please CC me on patches for GPUVM? > > On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in > > msm_gem_free_object()") for justification. > > Please write a proper commit message that explains the problem and the solution. > Please don't just refer to another commit and leave it to the reviewer of the > patch to figure this out. > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/drm_gpuvm.c | 7 +++++-- > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c > > index f9eb56f24bef..1e89a98caad4 100644 > > --- a/drivers/gpu/drm/drm_gpuvm.c > > +++ b/drivers/gpu/drm/drm_gpuvm.c > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref) > > drm_gpuvm_bo_list_del(vm_bo, extobj, lock); > > drm_gpuvm_bo_list_del(vm_bo, evict, lock); > > > > - drm_gem_gpuva_assert_lock_held(obj); > > + if (kref_read(&obj->refcount) > 0) > > + drm_gem_gpuva_assert_lock_held(obj); > > + > > list_del(&vm_bo->list.entry.gem); > > This seems wrong. > > A VM_BO object keeps a reference of the underlying GEM object, so this should > never happen. > > This function calls drm_gem_object_put() before it returns. I noticed your subsequent patch that allows VM_BO structures to have weak references to GEM objects. However, even with that this seems wrong. If the reference count of the GEM object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object potentially remains to be on the extobj and eviced list, which means that other code paths might fetch it from those lists and consider it to be a valid GEM object.
On Thu, May 15, 2025 at 2:06 AM Danilo Krummrich <dakr@kernel.org> wrote: > > On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote: > > Hi Rob, > > > > Can you please CC me on patches for GPUVM? > > > > On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote: > > > From: Rob Clark <robdclark@chromium.org> > > > > > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in > > > msm_gem_free_object()") for justification. > > > > Please write a proper commit message that explains the problem and the solution. > > Please don't just refer to another commit and leave it to the reviewer of the > > patch to figure this out. > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > --- > > > drivers/gpu/drm/drm_gpuvm.c | 7 +++++-- > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c > > > index f9eb56f24bef..1e89a98caad4 100644 > > > --- a/drivers/gpu/drm/drm_gpuvm.c > > > +++ b/drivers/gpu/drm/drm_gpuvm.c > > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref) > > > drm_gpuvm_bo_list_del(vm_bo, extobj, lock); > > > drm_gpuvm_bo_list_del(vm_bo, evict, lock); > > > > > > - drm_gem_gpuva_assert_lock_held(obj); > > > + if (kref_read(&obj->refcount) > 0) > > > + drm_gem_gpuva_assert_lock_held(obj); > > > + > > > list_del(&vm_bo->list.entry.gem); > > > > This seems wrong. > > > > A VM_BO object keeps a reference of the underlying GEM object, so this should > > never happen. > > > > This function calls drm_gem_object_put() before it returns. > > I noticed your subsequent patch that allows VM_BO structures to have weak > references to GEM objects. > > However, even with that this seems wrong. If the reference count of the GEM > object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM > object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object > potentially remains to be on the extobj and eviced list, which means that other > code paths might fetch it from those lists and consider it to be a valid GEM > object. We only iterate extobj or evicted in VM_BIND mode, where we aren't using WEAK_REF. I suppose some WARN_ON()s or BUG_ON()s could make this more clear. BR, -R
On Thu, May 15, 2025 at 10:35:21AM -0700, Rob Clark wrote: > On Thu, May 15, 2025 at 2:06 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote: > > > Hi Rob, > > > > > > Can you please CC me on patches for GPUVM? > > > > > > On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote: > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in > > > > msm_gem_free_object()") for justification. > > > > > > Please write a proper commit message that explains the problem and the solution. > > > Please don't just refer to another commit and leave it to the reviewer of the > > > patch to figure this out. > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > --- > > > > drivers/gpu/drm/drm_gpuvm.c | 7 +++++-- > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c > > > > index f9eb56f24bef..1e89a98caad4 100644 > > > > --- a/drivers/gpu/drm/drm_gpuvm.c > > > > +++ b/drivers/gpu/drm/drm_gpuvm.c > > > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref) > > > > drm_gpuvm_bo_list_del(vm_bo, extobj, lock); > > > > drm_gpuvm_bo_list_del(vm_bo, evict, lock); > > > > > > > > - drm_gem_gpuva_assert_lock_held(obj); > > > > + if (kref_read(&obj->refcount) > 0) > > > > + drm_gem_gpuva_assert_lock_held(obj); > > > > + > > > > list_del(&vm_bo->list.entry.gem); > > > > > > This seems wrong. > > > > > > A VM_BO object keeps a reference of the underlying GEM object, so this should > > > never happen. > > > > > > This function calls drm_gem_object_put() before it returns. > > > > I noticed your subsequent patch that allows VM_BO structures to have weak > > references to GEM objects. > > > > However, even with that this seems wrong. If the reference count of the GEM > > object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM > > object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object > > potentially remains to be on the extobj and eviced list, which means that other > > code paths might fetch it from those lists and consider it to be a valid GEM > > object. > > We only iterate extobj or evicted in VM_BIND mode, where we aren't > using WEAK_REF. I suppose some WARN_ON()s or BUG_ON()s could make > this more clear. There is also the GEM object's list of VM_BOs, are you using that? Anyways, I don't agree with that. Even if you can tweak your driver to not run into trouble with this, we can't introduce a mode that violates GOUVM's internal lifetimes and subsequently fix it up with WARN_ON() or BUG_ON(). I still don't see a real technical reason why msm can't be reworked to follow those lifetime rules.
On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote: > > On Thu, May 15, 2025 at 10:35:21AM -0700, Rob Clark wrote: > > On Thu, May 15, 2025 at 2:06 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > > > On Thu, May 15, 2025 at 10:54:27AM +0200, Danilo Krummrich wrote: > > > > Hi Rob, > > > > > > > > Can you please CC me on patches for GPUVM? > > > > > > > > On Wed, May 14, 2025 at 10:53:15AM -0700, Rob Clark wrote: > > > > > From: Rob Clark <robdclark@chromium.org> > > > > > > > > > > See commit a414fe3a2129 ("drm/msm/gem: Drop obj lock in > > > > > msm_gem_free_object()") for justification. > > > > > > > > Please write a proper commit message that explains the problem and the solution. > > > > Please don't just refer to another commit and leave it to the reviewer of the > > > > patch to figure this out. > > > > > > > > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > > > > --- > > > > > drivers/gpu/drm/drm_gpuvm.c | 7 +++++-- > > > > > 1 file changed, 5 insertions(+), 2 deletions(-) > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c > > > > > index f9eb56f24bef..1e89a98caad4 100644 > > > > > --- a/drivers/gpu/drm/drm_gpuvm.c > > > > > +++ b/drivers/gpu/drm/drm_gpuvm.c > > > > > @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref) > > > > > drm_gpuvm_bo_list_del(vm_bo, extobj, lock); > > > > > drm_gpuvm_bo_list_del(vm_bo, evict, lock); > > > > > > > > > > - drm_gem_gpuva_assert_lock_held(obj); > > > > > + if (kref_read(&obj->refcount) > 0) > > > > > + drm_gem_gpuva_assert_lock_held(obj); > > > > > + > > > > > list_del(&vm_bo->list.entry.gem); > > > > > > > > This seems wrong. > > > > > > > > A VM_BO object keeps a reference of the underlying GEM object, so this should > > > > never happen. > > > > > > > > This function calls drm_gem_object_put() before it returns. > > > > > > I noticed your subsequent patch that allows VM_BO structures to have weak > > > references to GEM objects. > > > > > > However, even with that this seems wrong. If the reference count of the GEM > > > object is zero when drm_gpuvm_bo_destroy() is called it means that the GEM > > > object is dead. However, until drm_gpuvm_bo_destroy() is called the GEM object > > > potentially remains to be on the extobj and eviced list, which means that other > > > code paths might fetch it from those lists and consider it to be a valid GEM > > > object. > > > > We only iterate extobj or evicted in VM_BIND mode, where we aren't > > using WEAK_REF. I suppose some WARN_ON()s or BUG_ON()s could make > > this more clear. > > There is also the GEM object's list of VM_BOs, are you using that? yes, but at this point there are no more ref's to the obj, and that list is obj specific > Anyways, I don't agree with that. Even if you can tweak your driver to not run > into trouble with this, we can't introduce a mode that violates GOUVM's internal > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON(). > > I still don't see a real technical reason why msm can't be reworked to follow > those lifetime rules. The basic issue is that (a) it would be really awkward to have two side-by-side VM/VMA management/tracking systems. But in legacy mode, we have the opposite direction of reference holding. (But at the same time, don't need/use most of the features of gpuvm.) BR, -R
On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote: > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote: > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON(). > > > > > > I still don't see a real technical reason why msm can't be reworked to follow > > > those lifetime rules. > > > > The basic issue is that (a) it would be really awkward to have two > > side-by-side VM/VMA management/tracking systems. But in legacy mode, > > we have the opposite direction of reference holding. (But at the same > > time, don't need/use most of the features of gpuvm.) > > Ok, let's try to move this forward; I see three options (in order of descending > preference): > > 1) Rework the legacy code to properly work with GPUVM. > 2) Don't use GPUVM for the legacy mode. > . > . > . > 3) Get an ACK from Dave / Sima to implement those workarounds for MSM in > GPUVM. > > If you go for 3), the code introduced by those two patches should be guarded > with a flag that makes it very clear that this is a workaround specifically > for MSM legacy mode and does not give any guarantees in terms of correctness > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK. I'm not even sure how #2 would work, other than just copy/pasta all of drm_gpuvm into msm, which doesn't really seem great. As for #1, even if I could get it to work, it would still be a lot more mmu map/unmap (like on every pageflip, vs the current state that the vma is kept around until the object is freed). For the non-VM_BIND world, there are advantages to the BO holding the ref to the VMA, rather than the other way around. Even at just a modest single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap costs a tlbinv). So from that standpoint, #3 is the superior option. BR, -R
On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote: > > On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote: > > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run > > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal > > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON(). > > > > > > > > I still don't see a real technical reason why msm can't be reworked to follow > > > > those lifetime rules. > > > > > > The basic issue is that (a) it would be really awkward to have two > > > side-by-side VM/VMA management/tracking systems. But in legacy mode, > > > we have the opposite direction of reference holding. (But at the same > > > time, don't need/use most of the features of gpuvm.) > > > > Ok, let's try to move this forward; I see three options (in order of descending > > preference): > > > > 1) Rework the legacy code to properly work with GPUVM. > > 2) Don't use GPUVM for the legacy mode. > > . > > . > > . > > 3) Get an ACK from Dave / Sima to implement those workarounds for MSM in > > GPUVM. > > > > If you go for 3), the code introduced by those two patches should be guarded > > with a flag that makes it very clear that this is a workaround specifically > > for MSM legacy mode and does not give any guarantees in terms of correctness > > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK. > > I'm not even sure how #2 would work, other than just copy/pasta all of > drm_gpuvm into msm, which doesn't really seem great. > > As for #1, even if I could get it to work, it would still be a lot > more mmu map/unmap (like on every pageflip, vs the current state that > the vma is kept around until the object is freed). For the > non-VM_BIND world, there are advantages to the BO holding the ref to > the VMA, rather than the other way around. Even at just a modest > single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap > costs a tlbinv). So from that standpoint, #3 is the superior option. > Before we get to #3, I'll need a bit more info here on why you have to map/unmap the VMA on every pageflip. But actually I think 2 is the best option, I think in nouveau this is where we ended up, we didn't modify the old submission paths at all and kept the old bo/vm lifetimes. We just added completely new bind/exec ioctls and you can only use one method once you've opened an fd. Dave.
On Tue, May 20, 2025 at 2:25 PM Dave Airlie <airlied@gmail.com> wrote: > > On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote: > > > > On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote: > > > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run > > > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal > > > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON(). > > > > > > > > > > I still don't see a real technical reason why msm can't be reworked to follow > > > > > those lifetime rules. > > > > > > > > The basic issue is that (a) it would be really awkward to have two > > > > side-by-side VM/VMA management/tracking systems. But in legacy mode, > > > > we have the opposite direction of reference holding. (But at the same > > > > time, don't need/use most of the features of gpuvm.) > > > > > > Ok, let's try to move this forward; I see three options (in order of descending > > > preference): > > > > > > 1) Rework the legacy code to properly work with GPUVM. > > > 2) Don't use GPUVM for the legacy mode. > > > . > > > . > > > . > > > 3) Get an ACK from Dave / Sima to implement those workarounds for MSM in > > > GPUVM. > > > > > > If you go for 3), the code introduced by those two patches should be guarded > > > with a flag that makes it very clear that this is a workaround specifically > > > for MSM legacy mode and does not give any guarantees in terms of correctness > > > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK. > > > > I'm not even sure how #2 would work, other than just copy/pasta all of > > drm_gpuvm into msm, which doesn't really seem great. > > > > As for #1, even if I could get it to work, it would still be a lot > > more mmu map/unmap (like on every pageflip, vs the current state that > > the vma is kept around until the object is freed). For the > > non-VM_BIND world, there are advantages to the BO holding the ref to > > the VMA, rather than the other way around. Even at just a modest > > single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap > > costs a tlbinv). So from that standpoint, #3 is the superior option. > > > > Before we get to #3, I'll need a bit more info here on why you have to > map/unmap the VMA on every pageflip. Previously we'd keep the VMA hanging around until the GEM obj is freed. But that can't work if the VMA (via the VM_BO) is holding a reference to the GEM obj. I was kinda thinking about keeping the VMA around until the handle is closed.. but that doesn't cover the dma-buf case (ie. when you re-import the dma-buf fd each frame.. I know android does this, unsure about other wsi's). > But actually I think 2 is the best option, I think in nouveau this is > where we ended up, we didn't modify the old submission paths at all > and kept the old bo/vm lifetimes. > > We just added completely new bind/exec ioctls and you can only use one > method once you've opened an fd. hmm, but that means tracking VMAs against a single BO differently.. which.. at least seems ugly.. BR, -R
On Wed, 21 May 2025 at 07:53, Rob Clark <robdclark@gmail.com> wrote: > > On Tue, May 20, 2025 at 2:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > > On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote: > > > > > > On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > > > > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote: > > > > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run > > > > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal > > > > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON(). > > > > > > > > > > > > I still don't see a real technical reason why msm can't be reworked to follow > > > > > > those lifetime rules. > > > > > > > > > > The basic issue is that (a) it would be really awkward to have two > > > > > side-by-side VM/VMA management/tracking systems. But in legacy mode, > > > > > we have the opposite direction of reference holding. (But at the same > > > > > time, don't need/use most of the features of gpuvm.) > > > > > > > > Ok, let's try to move this forward; I see three options (in order of descending > > > > preference): > > > > > > > > 1) Rework the legacy code to properly work with GPUVM. > > > > 2) Don't use GPUVM for the legacy mode. > > > > . > > > > . > > > > . > > > > 3) Get an ACK from Dave / Sima to implement those workarounds for MSM in > > > > GPUVM. > > > > > > > > If you go for 3), the code introduced by those two patches should be guarded > > > > with a flag that makes it very clear that this is a workaround specifically > > > > for MSM legacy mode and does not give any guarantees in terms of correctness > > > > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK. > > > > > > I'm not even sure how #2 would work, other than just copy/pasta all of > > > drm_gpuvm into msm, which doesn't really seem great. > > > > > > As for #1, even if I could get it to work, it would still be a lot > > > more mmu map/unmap (like on every pageflip, vs the current state that > > > the vma is kept around until the object is freed). For the > > > non-VM_BIND world, there are advantages to the BO holding the ref to > > > the VMA, rather than the other way around. Even at just a modest > > > single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap > > > costs a tlbinv). So from that standpoint, #3 is the superior option. > > > > > > > Before we get to #3, I'll need a bit more info here on why you have to > > map/unmap the VMA on every pageflip. > > Previously we'd keep the VMA hanging around until the GEM obj is > freed. But that can't work if the VMA (via the VM_BO) is holding a > reference to the GEM obj. > > I was kinda thinking about keeping the VMA around until the handle is > closed.. but that doesn't cover the dma-buf case (ie. when you > re-import the dma-buf fd each frame.. I know android does this, unsure > about other wsi's). > > > But actually I think 2 is the best option, I think in nouveau this is > > where we ended up, we didn't modify the old submission paths at all > > and kept the old bo/vm lifetimes. > > > > We just added completely new bind/exec ioctls and you can only use one > > method once you've opened an fd. > > hmm, but that means tracking VMAs against a single BO differently.. > which.. at least seems ugly.. I don't think it is if you already have the code to do that, and just add gpuvm support in parallel. You also have to figure out that the world is moving towards Vulkan for everything so any optimisations you've made for particular legacy paths will need to be dealt with in the future picture anyways. But I'd rather not hack gpuvm into being something it isn't, if there is a meaningful commonality in legacy bo/vm bindings across drivers, we could create something new, but the ref counting and handling is pretty fundamental to gpuvm architecture. There should only be two paths, legacy and gpuvm, and you shouldn't ever be mixing them on a particular exec path, since you should only have a vm per userspace fd, and can pick which way to use it the first time someone calls it. Dave. Dave.
On Tue, May 20, 2025 at 3:31 PM Dave Airlie <airlied@gmail.com> wrote: > > On Wed, 21 May 2025 at 07:53, Rob Clark <robdclark@gmail.com> wrote: > > > > On Tue, May 20, 2025 at 2:25 PM Dave Airlie <airlied@gmail.com> wrote: > > > > > > On Sat, 17 May 2025 at 02:20, Rob Clark <robdclark@gmail.com> wrote: > > > > > > > > On Fri, May 16, 2025 at 2:01 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > > > > > > > On Thu, May 15, 2025 at 02:57:46PM -0700, Rob Clark wrote: > > > > > > On Thu, May 15, 2025 at 10:55 AM Danilo Krummrich <dakr@kernel.org> wrote: > > > > > > > Anyways, I don't agree with that. Even if you can tweak your driver to not run > > > > > > > into trouble with this, we can't introduce a mode that violates GOUVM's internal > > > > > > > lifetimes and subsequently fix it up with WARN_ON() or BUG_ON(). > > > > > > > > > > > > > > I still don't see a real technical reason why msm can't be reworked to follow > > > > > > > those lifetime rules. > > > > > > > > > > > > The basic issue is that (a) it would be really awkward to have two > > > > > > side-by-side VM/VMA management/tracking systems. But in legacy mode, > > > > > > we have the opposite direction of reference holding. (But at the same > > > > > > time, don't need/use most of the features of gpuvm.) > > > > > > > > > > Ok, let's try to move this forward; I see three options (in order of descending > > > > > preference): > > > > > > > > > > 1) Rework the legacy code to properly work with GPUVM. > > > > > 2) Don't use GPUVM for the legacy mode. > > > > > . > > > > > . > > > > > . > > > > > 3) Get an ACK from Dave / Sima to implement those workarounds for MSM in > > > > > GPUVM. > > > > > > > > > > If you go for 3), the code introduced by those two patches should be guarded > > > > > with a flag that makes it very clear that this is a workaround specifically > > > > > for MSM legacy mode and does not give any guarantees in terms of correctness > > > > > regarding lifetimes etc., e.g. DRM_GPUVM_MSM_LEGACY_QUIRK. > > > > > > > > I'm not even sure how #2 would work, other than just copy/pasta all of > > > > drm_gpuvm into msm, which doesn't really seem great. > > > > > > > > As for #1, even if I could get it to work, it would still be a lot > > > > more mmu map/unmap (like on every pageflip, vs the current state that > > > > the vma is kept around until the object is freed). For the > > > > non-VM_BIND world, there are advantages to the BO holding the ref to > > > > the VMA, rather than the other way around. Even at just a modest > > > > single layer 1080p the map takes ~.2ms and unmap ~.3ms (plus the unmap > > > > costs a tlbinv). So from that standpoint, #3 is the superior option. > > > > > > > > > > Before we get to #3, I'll need a bit more info here on why you have to > > > map/unmap the VMA on every pageflip. > > > > Previously we'd keep the VMA hanging around until the GEM obj is > > freed. But that can't work if the VMA (via the VM_BO) is holding a > > reference to the GEM obj. > > > > I was kinda thinking about keeping the VMA around until the handle is > > closed.. but that doesn't cover the dma-buf case (ie. when you > > re-import the dma-buf fd each frame.. I know android does this, unsure > > about other wsi's). > > > > > But actually I think 2 is the best option, I think in nouveau this is > > > where we ended up, we didn't modify the old submission paths at all > > > and kept the old bo/vm lifetimes. > > > > > > We just added completely new bind/exec ioctls and you can only use one > > > method once you've opened an fd. > > > > hmm, but that means tracking VMAs against a single BO differently.. > > which.. at least seems ugly.. > > I don't think it is if you already have the code to do that, and just > add gpuvm support in parallel. > > You also have to figure out that the world is moving towards Vulkan > for everything so any optimisations you've made for particular legacy > paths will need to be dealt with in the future picture anyways. fwiw, the case I'm more worried about is the kms vm for scanout, that won't be using vk BR, -R > But I'd rather not hack gpuvm into being something it isn't, if there > is a meaningful commonality in legacy bo/vm bindings across drivers, > we could create something new, but the ref counting and handling is > pretty fundamental to gpuvm architecture. > > There should only be two paths, legacy and gpuvm, and you shouldn't > ever be mixing them on a particular exec path, since you should only > have a vm per userspace fd, and can pick which way to use it the first > time someone calls it. > > Dave. > Dave.
diff --git a/drivers/gpu/drm/drm_gpuvm.c b/drivers/gpu/drm/drm_gpuvm.c index f9eb56f24bef..1e89a98caad4 100644 --- a/drivers/gpu/drm/drm_gpuvm.c +++ b/drivers/gpu/drm/drm_gpuvm.c @@ -1511,7 +1511,9 @@ drm_gpuvm_bo_destroy(struct kref *kref) drm_gpuvm_bo_list_del(vm_bo, extobj, lock); drm_gpuvm_bo_list_del(vm_bo, evict, lock); - drm_gem_gpuva_assert_lock_held(obj); + if (kref_read(&obj->refcount) > 0) + drm_gem_gpuva_assert_lock_held(obj); + list_del(&vm_bo->list.entry.gem); if (ops && ops->vm_bo_free) @@ -1871,7 +1873,8 @@ drm_gpuva_unlink(struct drm_gpuva *va) if (unlikely(!obj)) return; - drm_gem_gpuva_assert_lock_held(obj); + if (kref_read(&obj->refcount) > 0) + drm_gem_gpuva_assert_lock_held(obj); list_del_init(&va->gem.entry); va->vm_bo = NULL;