Message ID | 1617333527-89782-1-git-send-email-jinsdb@126.com |
---|---|
State | New |
Headers | show |
Series | drm/amdgpu: Fix a potential sdma invalid access | expand |
Hi Qu, Am 02.04.21 um 05:18 schrieb Qu Huang: > Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), > the bo->base.resv lock may be held by ttm_mem_evict_first(), That can't happen since when bo_release_notify is called the BO has not more references and is therefore deleted. And we never evict a deleted BO, we just wait for it to become idle. Regards, Christian. > and the VRAM mem will be evicted, mem region was replaced > by Gtt mem region. amdgpu_bo_release_notify() will then > hold the bo->base.resv lock, and SDMA will get an invalid > address in amdgpu_fill_buffer(), resulting in a VMFAULT > or memory corruption. > > To avoid it, we have to hold bo->base.resv lock first, and > check whether the mem.mem_type is TTM_PL_VRAM. > > Signed-off-by: Qu Huang <jinsdb@126.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > index 4b29b82..8018574 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c > @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) > if (bo->base.resv == &bo->base._resv) > amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); > > - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || > - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) > return; > > dma_resv_lock(bo->base.resv, NULL); > > + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { > + dma_resv_unlock(bo->base.resv); > + return; > + } > + > r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); > if (!WARN_ON(r)) { > amdgpu_bo_fence(abo, fence, false); > -- > 1.8.3.1 >
Hi Christian, On 2021/4/3 0:25, Christian König wrote: > Hi Qu, > > Am 02.04.21 um 05:18 schrieb Qu Huang: >> Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), >> the bo->base.resv lock may be held by ttm_mem_evict_first(), > > That can't happen since when bo_release_notify is called the BO has not > more references and is therefore deleted. > > And we never evict a deleted BO, we just wait for it to become idle. > Yes, the bo reference counter return to zero will enter ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) first happen, and then test if a reservation object's fences have been signaled, and then mark bo as deleted and remove bo from the LRU list. When ttm_bo_release() and ttm_mem_evict_first() is concurrent, the Bo has not been removed from the LRU list and is not marked as deleted, this will happen. As a test, when we use CPU memset instead of SDMA fill in amdgpu_bo_release_notify(), the result is page fault: PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 [exception RIP: memset+31] RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 [amdgpu] #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Regards, Qu. > Regards, > Christian. > >> and the VRAM mem will be evicted, mem region was replaced >> by Gtt mem region. amdgpu_bo_release_notify() will then >> hold the bo->base.resv lock, and SDMA will get an invalid >> address in amdgpu_fill_buffer(), resulting in a VMFAULT >> or memory corruption. >> >> To avoid it, we have to hold bo->base.resv lock first, and >> check whether the mem.mem_type is TTM_PL_VRAM. >> >> Signed-off-by: Qu Huang <jinsdb@126.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >> 1 file changed, 6 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> index 4b29b82..8018574 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >> ttm_buffer_object *bo) >> if (bo->base.resv == &bo->base._resv) >> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >> >> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >> return; >> >> dma_resv_lock(bo->base.resv, NULL); >> >> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >> + dma_resv_unlock(bo->base.resv); >> + return; >> + } >> + >> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); >> if (!WARN_ON(r)) { >> amdgpu_bo_fence(abo, fence, false); >> -- >> 1.8.3.1 >>
Hi Qu, Am 03.04.21 um 07:08 schrieb Qu Huang: > Hi Christian, > > On 2021/4/3 0:25, Christian König wrote: >> Hi Qu, >> >> Am 02.04.21 um 05:18 schrieb Qu Huang: >>> Before dma_resv_lock(bo->base.resv, NULL) in >>> amdgpu_bo_release_notify(), >>> the bo->base.resv lock may be held by ttm_mem_evict_first(), >> >> That can't happen since when bo_release_notify is called the BO has not >> more references and is therefore deleted. >> >> And we never evict a deleted BO, we just wait for it to become idle. >> > Yes, the bo reference counter return to zero will enter > ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify()) > first happen, and then test if a reservation object's fences have been > signaled, and then mark bo as deleted and remove bo from the LRU list. > > When ttm_bo_release() and ttm_mem_evict_first() is concurrent, > the Bo has not been removed from the LRU list and is not marked as > deleted, this will happen. Not sure on which code base you are, but I don't see how this can happen. ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and ttm_bo_release() is only called when the BO reference count becomes zero. So ttm_mem_evict_first() will see that this BO is about to be destroyed and skips it. > > As a test, when we use CPU memset instead of SDMA fill in > amdgpu_bo_release_notify(), the result is page fault: > > PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf" > #0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784 > #1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92 > #2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80 > #3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768 > #4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6 > #5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d > #6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae > #7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0 > #8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925 > #9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758 > [exception RIP: memset+31] > RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17 > RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000 > RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000 > RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000 > R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050 > R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1 > [amdgpu] > #11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm] > #12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm] > #13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm] > #14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115 > #15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64 > #16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7 > #17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95 > #18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf > #19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544 > #20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb Well that might be perfectly expected. VRAM is not necessarily CPU accessible. Regards, Christian. > > Regards, > Qu. > > >> Regards, >> Christian. >> >>> and the VRAM mem will be evicted, mem region was replaced >>> by Gtt mem region. amdgpu_bo_release_notify() will then >>> hold the bo->base.resv lock, and SDMA will get an invalid >>> address in amdgpu_fill_buffer(), resulting in a VMFAULT >>> or memory corruption. >>> >>> To avoid it, we have to hold bo->base.resv lock first, and >>> check whether the mem.mem_type is TTM_PL_VRAM. >>> >>> Signed-off-by: Qu Huang <jinsdb@126.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- >>> 1 file changed, 6 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> index 4b29b82..8018574 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c >>> @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct >>> ttm_buffer_object *bo) >>> if (bo->base.resv == &bo->base._resv) >>> amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); >>> >>> - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || >>> - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) >>> return; >>> >>> dma_resv_lock(bo->base.resv, NULL); >>> >>> + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { >>> + dma_resv_unlock(bo->base.resv); >>> + return; >>> + } >>> + >>> r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, >>> &fence); >>> if (!WARN_ON(r)) { >>> amdgpu_bo_fence(abo, fence, false); >>> -- >>> 1.8.3.1 >>> >
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c index 4b29b82..8018574 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo) if (bo->base.resv == &bo->base._resv) amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo); - if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node || - !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) + if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE)) return; dma_resv_lock(bo->base.resv, NULL); + if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) { + dma_resv_unlock(bo->base.resv); + return; + } + r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence); if (!WARN_ON(r)) { amdgpu_bo_fence(abo, fence, false);
Before dma_resv_lock(bo->base.resv, NULL) in amdgpu_bo_release_notify(), the bo->base.resv lock may be held by ttm_mem_evict_first(), and the VRAM mem will be evicted, mem region was replaced by Gtt mem region. amdgpu_bo_release_notify() will then hold the bo->base.resv lock, and SDMA will get an invalid address in amdgpu_fill_buffer(), resulting in a VMFAULT or memory corruption. To avoid it, we have to hold bo->base.resv lock first, and check whether the mem.mem_type is TTM_PL_VRAM. Signed-off-by: Qu Huang <jinsdb@126.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) -- 1.8.3.1