Message ID | 20220505092132.1901800-3-alan.previn.teres.alexis@intel.com |
---|---|
State | New |
Headers | show |
Series | [1/3] drm/i915: Revert "drm/i915/gem: Asynchronous cmdparser" | expand |
Acked-by: Jon Bloomfield <jon.bloomfield@intel.com> > -----Original Message----- > From: Teres Alexis, Alan Previn <alan.previn.teres.alexis@intel.com> > Sent: Thursday, May 5, 2022 2:22 AM > To: gfx-internal-devel@eclists.intel.com > Cc: Teres Alexis, Alan Previn <alan.previn.teres.alexis@intel.com>; Harrison, > John C <john.c.harrison@intel.com>; Jason Ekstrand > <jason@jlekstrand.net>; Slusarz, Marcin <marcin.slusarz@intel.com>; > stable@vger.kernel.org; Jason Ekstrand <jason.ekstrand@intel.com>; Daniel > Vetter <daniel.vetter@ffwll.ch>; Bloomfield, Jon > <jon.bloomfield@intel.com> > Subject: [PATCH 2/3] Revert "drm/i915: Propagate errors on awaiting already > signaled fences" > > From: Jason Ekstrand <jason@jlekstrand.net> > > This reverts commit 9e31c1fe45d555a948ff66f1f0e3fe1f83ca63f7. Ever > since that commit, we've been having issues where a hang in one client > can propagate to another. In particular, a hang in an app can propagate > to the X server which causes the whole desktop to lock up. > > Error propagation along fences sound like a good idea, but as your bug > shows, surprising consequences, since propagating errors across security > boundaries is not a good thing. > > What we do have is track the hangs on the ctx, and report information to > userspace using RESET_STATS. That's how arb_robustness works. Also, if my > understanding is still correct, the EIO from execbuf is when your context > is banned (because not recoverable or too many hangs). And in all these > cases it's up to userspace to figure out what is all impacted and should > be reported to the application, that's not on the kernel to guess and > automatically propagate. > > What's more, we're also building more features on top of ctx error > reporting with RESET_STATS ioctl: Encrypted buffers use the same, and the > userspace fence wait also relies on that mechanism. So it is the path > going forward for reporting gpu hangs and resets to userspace. > > So all together that's why I think we should just bury this idea again as > not quite the direction we want to go to, hence why I think the revert is > the right option here. > > For backporters: Please note that you _must_ have a backport of > https://lore.kernel.org/dri-devel/20210602164149.391653-2- > jason@jlekstrand.net/ > for otherwise backporting just this patch opens up a security bug. > > v2: Augment commit message. Also restore Jason's sob that I > accidentally lost. > > v3: Add a note for backporters > > Signed-off-by: Jason Ekstrand <jason@jlekstrand.net> > Reported-by: Marcin Slusarz <marcin.slusarz@intel.com> > Cc: <stable@vger.kernel.org> # v5.6+ > Cc: Jason Ekstrand <jason.ekstrand@intel.com> > Cc: Marcin Slusarz <marcin.slusarz@intel.com> > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3080 > Fixes: 9e31c1fe45d5 ("drm/i915: Propagate errors on awaiting already > signaled fences") > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> > Reviewed-by: Jon Bloomfield <jon.bloomfield@intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > Link: > https://patchwork.freedesktop.org/patch/msgid/20210714193419.1459723- > 3-jason@jlekstrand.net > (cherry picked from commit 93a2711cddd5760e2f0f901817d71c93183c3b87) > --- > drivers/gpu/drm/i915/i915_request.c | 8 ++------ > 1 file changed, 2 insertions(+), 6 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_request.c > b/drivers/gpu/drm/i915/i915_request.c > index 8bd484e2a0ec..08484c14d11e 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct > i915_request *rq, > > do { > fence = *child++; > - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence- > >flags)) { > - i915_sw_fence_set_error_once(&rq->submit, fence- > >error); > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence- > >flags)) > continue; > - } > > if (fence->context == rq->fence.context) > continue; > @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct > i915_request *rq, struct dma_fence *fence) > > do { > fence = *child++; > - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence- > >flags)) { > - i915_sw_fence_set_error_once(&rq->submit, fence- > >error); > + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence- > >flags)) > continue; > - } > > /* > * Requests on the same timeline are explicitly ordered, > along
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 8bd484e2a0ec..08484c14d11e 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1426,10 +1426,8 @@ i915_request_await_execution(struct i915_request *rq, do { fence = *child++; - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { - i915_sw_fence_set_error_once(&rq->submit, fence->error); + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) continue; - } if (fence->context == rq->fence.context) continue; @@ -1527,10 +1525,8 @@ i915_request_await_dma_fence(struct i915_request *rq, struct dma_fence *fence) do { fence = *child++; - if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) { - i915_sw_fence_set_error_once(&rq->submit, fence->error); + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) continue; - } /* * Requests on the same timeline are explicitly ordered, along