Message ID | 20230525024438.507082-1-xiubli@redhat.com |
---|---|
State | New |
Headers | show |
Series | ceph: fix use-after-free bug for inodes when flushing capsnaps | expand |
On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote: > > From: Xiubo Li <xiubli@redhat.com> > > There is racy between capsnaps flush and removing the inode from > 'mdsc->snap_flush_list' list: > > Thread A Thread B > ceph_queue_cap_snap() > -> allocate 'capsnapA' > ->ihold('&ci->vfs_inode') > ->add 'capsnapA' to 'ci->i_cap_snaps' > ->add 'ci' to 'mdsc->snap_flush_list' > ... > ceph_flush_snaps() > ->__ceph_flush_snaps() > ->__send_flush_snap() > handle_cap_flushsnap_ack() > ->iput('&ci->vfs_inode') > this also will release 'ci' > ... > ceph_handle_snap() > ->flush_snaps() > ->iterate 'mdsc->snap_flush_list' > ->get the stale 'ci' > ->remove 'ci' from ->ihold(&ci->vfs_inode) this > 'mdsc->snap_flush_list' will WARNING > > To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' > list just before '__send_flush_snaps()' to make sure the flushsnap > 'ack' will always after removing the 'ci' from 'snap_flush_list'. Hi Xiubo, I'm not sure I'm following the logic here. If the issue is that the inode can be released by handle_cap_flushsnap_ack(), meaning that ci is unsafe to dereference after the ack is received, what makes e.g. the following snippet in __ceph_flush_snaps() work: ret = __send_flush_snap(inode, session, capsnap, cap->mseq, oldest_flush_tid); if (ret < 0) { pr_err("__flush_snaps: error sending cap flushsnap, " "ino (%llx.%llx) tid %llu follows %llu\n", ceph_vinop(inode), cf->tid, capsnap->follows); } ceph_put_cap_snap(capsnap); spin_lock(&ci->i_ceph_lock); If the ack is handled after capsnap is put but before ci->i_ceph_lock is reacquired, could use-after-free occur inside spin_lock()? Thanks, Ilya > > Cc: stable@vger.kernel.org > URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 > Signed-off-by: Xiubo Li <xiubli@redhat.com> > --- > fs/ceph/caps.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > index feabf4cc0c4f..a8f890b3bb9a 100644 > --- a/fs/ceph/caps.c > +++ b/fs/ceph/caps.c > @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci, > > dout("__flush_snaps %p session %p\n", inode, session); > > + /* we will flush them all; remove this inode from the queue */ > + spin_lock(&mdsc->snap_flush_lock); > + list_del_init(&ci->i_snap_flush_item); > + spin_unlock(&mdsc->snap_flush_lock); > + > list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { > /* > * we need to wait for sync writes to complete and for dirty > @@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci, > *psession = session; > else > ceph_put_mds_session(session); > - /* we flushed them all; remove this inode from the queue */ > - spin_lock(&mdsc->snap_flush_lock); > - list_del_init(&ci->i_snap_flush_item); > - spin_unlock(&mdsc->snap_flush_lock); > } > > /* > -- > 2.40.1 >
On 5/31/23 19:09, Ilya Dryomov wrote: > On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote: >> From: Xiubo Li <xiubli@redhat.com> >> >> There is racy between capsnaps flush and removing the inode from >> 'mdsc->snap_flush_list' list: >> >> Thread A Thread B >> ceph_queue_cap_snap() >> -> allocate 'capsnapA' >> ->ihold('&ci->vfs_inode') >> ->add 'capsnapA' to 'ci->i_cap_snaps' >> ->add 'ci' to 'mdsc->snap_flush_list' >> ... >> ceph_flush_snaps() >> ->__ceph_flush_snaps() >> ->__send_flush_snap() >> handle_cap_flushsnap_ack() >> ->iput('&ci->vfs_inode') >> this also will release 'ci' >> ... >> ceph_handle_snap() >> ->flush_snaps() >> ->iterate 'mdsc->snap_flush_list' >> ->get the stale 'ci' >> ->remove 'ci' from ->ihold(&ci->vfs_inode) this >> 'mdsc->snap_flush_list' will WARNING >> >> To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' >> list just before '__send_flush_snaps()' to make sure the flushsnap >> 'ack' will always after removing the 'ci' from 'snap_flush_list'. > Hi Xiubo, > > I'm not sure I'm following the logic here. If the issue is that the > inode can be released by handle_cap_flushsnap_ack(), meaning that ci is > unsafe to dereference after the ack is received, what makes e.g. the > following snippet in __ceph_flush_snaps() work: > > ret = __send_flush_snap(inode, session, capsnap, cap->mseq, > oldest_flush_tid); > if (ret < 0) { > pr_err("__flush_snaps: error sending cap flushsnap, " > "ino (%llx.%llx) tid %llu follows %llu\n", > ceph_vinop(inode), cf->tid, capsnap->follows); > } > > ceph_put_cap_snap(capsnap); > spin_lock(&ci->i_ceph_lock); > > If the ack is handled after capsnap is put but before ci->i_ceph_lock > is reacquired, could use-after-free occur inside spin_lock()? Yeah, certainly this could happen. After the 'ci' being freed it's possible that the 'ci' is still cached in the 'ceph_inode_cachep' slab caches or reused by others, so dereferring the 'ci' mostly won't crash. But just before freeing 'ci' the 'ci->vfs_inode.i_count' will always be 0 already, this is why we can see the 'WARN_ON(atomic_inc_return(&inode->i_count) < 2)' call trace much easier in use-after-free case. Thanks - Xiubo > Thanks, > > Ilya > >> Cc: stable@vger.kernel.org >> URL: https://bugzilla.redhat.com/show_bug.cgi?id=2209299 >> Signed-off-by: Xiubo Li <xiubli@redhat.com> >> --- >> fs/ceph/caps.c | 9 +++++---- >> 1 file changed, 5 insertions(+), 4 deletions(-) >> >> diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c >> index feabf4cc0c4f..a8f890b3bb9a 100644 >> --- a/fs/ceph/caps.c >> +++ b/fs/ceph/caps.c >> @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci, >> >> dout("__flush_snaps %p session %p\n", inode, session); >> >> + /* we will flush them all; remove this inode from the queue */ >> + spin_lock(&mdsc->snap_flush_lock); >> + list_del_init(&ci->i_snap_flush_item); >> + spin_unlock(&mdsc->snap_flush_lock); >> + >> list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { >> /* >> * we need to wait for sync writes to complete and for dirty >> @@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci, >> *psession = session; >> else >> ceph_put_mds_session(session); >> - /* we flushed them all; remove this inode from the queue */ >> - spin_lock(&mdsc->snap_flush_lock); >> - list_del_init(&ci->i_snap_flush_item); >> - spin_unlock(&mdsc->snap_flush_lock); >> } >> >> /* >> -- >> 2.40.1 >>
On Wed, May 31, 2023 at 1:33 PM Xiubo Li <xiubli@redhat.com> wrote: > > > On 5/31/23 19:09, Ilya Dryomov wrote: > > On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote: > >> From: Xiubo Li <xiubli@redhat.com> > >> > >> There is racy between capsnaps flush and removing the inode from > >> 'mdsc->snap_flush_list' list: > >> > >> Thread A Thread B > >> ceph_queue_cap_snap() > >> -> allocate 'capsnapA' > >> ->ihold('&ci->vfs_inode') > >> ->add 'capsnapA' to 'ci->i_cap_snaps' > >> ->add 'ci' to 'mdsc->snap_flush_list' > >> ... > >> ceph_flush_snaps() > >> ->__ceph_flush_snaps() > >> ->__send_flush_snap() > >> handle_cap_flushsnap_ack() > >> ->iput('&ci->vfs_inode') > >> this also will release 'ci' > >> ... > >> ceph_handle_snap() > >> ->flush_snaps() > >> ->iterate 'mdsc->snap_flush_list' > >> ->get the stale 'ci' > >> ->remove 'ci' from ->ihold(&ci->vfs_inode) this > >> 'mdsc->snap_flush_list' will WARNING > >> > >> To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' > >> list just before '__send_flush_snaps()' to make sure the flushsnap > >> 'ack' will always after removing the 'ci' from 'snap_flush_list'. > > Hi Xiubo, > > > > I'm not sure I'm following the logic here. If the issue is that the > > inode can be released by handle_cap_flushsnap_ack(), meaning that ci is > > unsafe to dereference after the ack is received, what makes e.g. the > > following snippet in __ceph_flush_snaps() work: > > > > ret = __send_flush_snap(inode, session, capsnap, cap->mseq, > > oldest_flush_tid); > > if (ret < 0) { > > pr_err("__flush_snaps: error sending cap flushsnap, " > > "ino (%llx.%llx) tid %llu follows %llu\n", > > ceph_vinop(inode), cf->tid, capsnap->follows); > > } > > > > ceph_put_cap_snap(capsnap); > > spin_lock(&ci->i_ceph_lock); > > > > If the ack is handled after capsnap is put but before ci->i_ceph_lock > > is reacquired, could use-after-free occur inside spin_lock()? > > Yeah, certainly this could happen. > > After the 'ci' being freed it's possible that the 'ci' is still cached > in the 'ceph_inode_cachep' slab caches or reused by others, so > dereferring the 'ci' mostly won't crash. But just before freeing 'ci' Dereferencing an invalid pointer is a major bug regardless of whether it leads to a crash. Crashing fast is actually a pretty good case as far as potential outcomes go. This needs to be addressed: just removing ci from mdsc->snap_flush_list earlier obviously doesn't cut it. Thanks, Ilya
On 5/31/23 20:01, Ilya Dryomov wrote: > On Wed, May 31, 2023 at 1:33 PM Xiubo Li <xiubli@redhat.com> wrote: >> >> On 5/31/23 19:09, Ilya Dryomov wrote: >>> On Thu, May 25, 2023 at 4:45 AM <xiubli@redhat.com> wrote: >>>> From: Xiubo Li <xiubli@redhat.com> >>>> >>>> There is racy between capsnaps flush and removing the inode from >>>> 'mdsc->snap_flush_list' list: >>>> >>>> Thread A Thread B >>>> ceph_queue_cap_snap() >>>> -> allocate 'capsnapA' >>>> ->ihold('&ci->vfs_inode') >>>> ->add 'capsnapA' to 'ci->i_cap_snaps' >>>> ->add 'ci' to 'mdsc->snap_flush_list' >>>> ... >>>> ceph_flush_snaps() >>>> ->__ceph_flush_snaps() >>>> ->__send_flush_snap() >>>> handle_cap_flushsnap_ack() >>>> ->iput('&ci->vfs_inode') >>>> this also will release 'ci' >>>> ... >>>> ceph_handle_snap() >>>> ->flush_snaps() >>>> ->iterate 'mdsc->snap_flush_list' >>>> ->get the stale 'ci' >>>> ->remove 'ci' from ->ihold(&ci->vfs_inode) this >>>> 'mdsc->snap_flush_list' will WARNING >>>> >>>> To fix this we will remove the 'ci' from 'mdsc->snap_flush_list' >>>> list just before '__send_flush_snaps()' to make sure the flushsnap >>>> 'ack' will always after removing the 'ci' from 'snap_flush_list'. >>> Hi Xiubo, >>> >>> I'm not sure I'm following the logic here. If the issue is that the >>> inode can be released by handle_cap_flushsnap_ack(), meaning that ci is >>> unsafe to dereference after the ack is received, what makes e.g. the >>> following snippet in __ceph_flush_snaps() work: >>> >>> ret = __send_flush_snap(inode, session, capsnap, cap->mseq, >>> oldest_flush_tid); >>> if (ret < 0) { >>> pr_err("__flush_snaps: error sending cap flushsnap, " >>> "ino (%llx.%llx) tid %llu follows %llu\n", >>> ceph_vinop(inode), cf->tid, capsnap->follows); >>> } >>> >>> ceph_put_cap_snap(capsnap); >>> spin_lock(&ci->i_ceph_lock); >>> >>> If the ack is handled after capsnap is put but before ci->i_ceph_lock >>> is reacquired, could use-after-free occur inside spin_lock()? >> Yeah, certainly this could happen. >> >> After the 'ci' being freed it's possible that the 'ci' is still cached >> in the 'ceph_inode_cachep' slab caches or reused by others, so >> dereferring the 'ci' mostly won't crash. But just before freeing 'ci' > Dereferencing an invalid pointer is a major bug regardless of whether > it leads to a crash. Crashing fast is actually a pretty good case as > far as potential outcomes go. > > This needs to be addressed: just removing ci from mdsc->snap_flush_list > earlier obviously doesn't cut it. This could make sure that the 'ci' will be remove from the 'mdsc->snap_flush_list' list before the ack. While there is another method to fix this, which is to increase the 'i_count' instead when adding the 'ci' to the 'mdsc->snap_flush_list' list. This one seems safer ? Thanks - Xiubo > Thanks, > > Ilya >
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index feabf4cc0c4f..a8f890b3bb9a 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1595,6 +1595,11 @@ static void __ceph_flush_snaps(struct ceph_inode_info *ci, dout("__flush_snaps %p session %p\n", inode, session); + /* we will flush them all; remove this inode from the queue */ + spin_lock(&mdsc->snap_flush_lock); + list_del_init(&ci->i_snap_flush_item); + spin_unlock(&mdsc->snap_flush_lock); + list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) { /* * we need to wait for sync writes to complete and for dirty @@ -1726,10 +1731,6 @@ void ceph_flush_snaps(struct ceph_inode_info *ci, *psession = session; else ceph_put_mds_session(session); - /* we flushed them all; remove this inode from the queue */ - spin_lock(&mdsc->snap_flush_lock); - list_del_init(&ci->i_snap_flush_item); - spin_unlock(&mdsc->snap_flush_lock); } /*