Message ID | 20240503211129.679762-2-torvalds@linux-foundation.org |
---|---|
State | New |
Headers | show |
Series | epoll: try to be a _bit_ better about file lifetimes | expand |
On Fri, May 03, 2024 at 02:11:30PM -0700, Linus Torvalds wrote: > epoll is a mess, and does various invalid things in the name of > performance. > > Let's try to rein it in a bit. Something like this, perhaps? > +/* > + * The ffd.file pointer may be in the process of > + * being torn down due to being closed, but we > + * may not have finished eventpoll_release() yet. > + * > + * Technically, even with the atomic_long_inc_not_zero, > + * the file may have been free'd and then gotten > + * re-allocated to something else (since files are > + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU). Can we get to ep_item_poll(epi, ...) after eventpoll_release_file() got past __ep_remove()? Because if we can, we have a worse problem - epi freed under us. If not, we couldn't possibly have reached ->release() yet, let alone freeing anything.
On Fri, 3 May 2024 at 14:24, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Can we get to ep_item_poll(epi, ...) after eventpoll_release_file() > got past __ep_remove()? Because if we can, we have a worse problem - > epi freed under us. Look at the hack in __ep_remove(): if it is concurrent with eventpoll_release_file(), it will hit this code spin_lock(&file->f_lock); if (epi->dying && !force) { spin_unlock(&file->f_lock); return false; } and not free the epi. But as far as I can tell, almost nothing else cares about the f_lock and dying logic. And in fact, I don't think doing spin_lock(&file->f_lock); is even valid in the places that look up file through "epi->ffd.file", because the lock itself is inside the thing that you can't trust until you've taken the lock... So I agree with Kees about the use of "atomic_dec_not_zero()" kind of logic - but it also needs to be in an RCU-readlocked region, I think. I wish epoll() just took the damn file ref itself. But since it relies on the file refcount to release the data structure, that obviously can't work. Linus
On Fri, May 03, 2024 at 02:33:37PM -0700, Linus Torvalds wrote: > Look at the hack in __ep_remove(): if it is concurrent with > eventpoll_release_file(), it will hit this code > > spin_lock(&file->f_lock); > if (epi->dying && !force) { > spin_unlock(&file->f_lock); > return false; > } > > and not free the epi. What does that have to do with ep_item_poll()? eventpoll_release_file() itself calls __ep_remove(). Have that happen while ep_item_poll() is running in another thread and you've got a problem. AFAICS, exclusion is on ep->mtx. Callers of ep_item_poll() are * __ep_eventpoll_poll() - grabs ->mtx * ep_insert() - called under ->mtx * ep_modify() - calls are under ->mtx * ep_send_events() - grabs ->mtx and eventpoll_release_file() grabs ->mtx around __ep_remove(). How do you get through eventpoll_release_file() while someone has entered ep_item_poll()?
On Fri, 3 May 2024 at 14:45, Al Viro <viro@zeniv.linux.org.uk> wrote: > > How do you get through eventpoll_release_file() while someone > has entered ep_item_poll()? Doesn't matter. Look, it's enough that the file count has gone down to zero. You may not even have gotten to eventpoll_release_file() yet - the important part is that you're on your *way* to it. That means that the file will be released - and it means that you have violated all the refcounting rules for poll(). So I think you're barking up the wrong tree. Linus
On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote: > On Fri, 3 May 2024 at 14:45, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > How do you get through eventpoll_release_file() while someone > > has entered ep_item_poll()? > > Doesn't matter. > > Look, it's enough that the file count has gone down to zero. You may > not even have gotten to eventpoll_release_file() yet - the important > part is that you're on your *way* to it. > > That means that the file will be released - and it means that you have > violated all the refcounting rules for poll(). > > So I think you're barking up the wrong tree. IMO there are several things in that mess (aside of epoll being what it is). Trying to grab refcount as you do is fine; the comment is seriously misleading, though - we *are* guaranteed that struct file hasn't hit ->release(), let alone getting freed and reused. Having pollwait callback grab references is fine - and the callback belongs to whoever's calling ->poll(). Having ->poll() instance itself grab reference is really asking for problem, even on the boxen that have CONFIG_EPOLL turned off. select(2) is enough; it will take care of references grabbed by __pollwait(), but it doesn't know anything about the references driver has stashed hell knows where for hell knows how long.
On Fri, May 03, 2024 at 11:01:45PM +0100, Al Viro wrote: > Having ->poll() instance itself grab reference is really asking for problem, > even on the boxen that have CONFIG_EPOLL turned off. select(2) is enough; > it will take care of references grabbed by __pollwait(), but it doesn't > know anything about the references driver has stashed hell knows where for > hell knows how long. Suppose your program calls select() on a pipe and dmabuf, sees data to be read from pipe, reads it, closes both pipe and dmabuf and exits. Would you expect that dmabuf file would stick around for hell knows how long after that? I would certainly be very surprised by running into that...
On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote: > That means that the file will be released - and it means that you have > violated all the refcounting rules for poll(). I feel like I've been looking at this too long. I think I see another problem here, but with dmabuf even when epoll is fixed: dma_buf_poll() get_file(dmabuf->file) /* f_count + 1 */ dma_buf_poll_add_cb() dma_resv_for_each_fence ... dma_fence_add_callback(fence, ..., dma_buf_poll_cb) dma_buf_poll_cb() ... fput(dmabuf->file); /* f_count - 1 ... for each fence */ Isn't it possible to call dma_buf_poll_cb() (and therefore fput()) multiple times if there is more than 1 fence? Perhaps I've missed a place where a single struct dma_resv will only ever signal 1 fence? But looking through dma_fence_signal_timestamp_locked(), I don't see anything about resv nor somehow looking into other fence cb_list contents...
On Fri, May 03, 2024 at 03:46:25PM -0700, Kees Cook wrote: > On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote: > > That means that the file will be released - and it means that you have > > violated all the refcounting rules for poll(). > > I feel like I've been looking at this too long. I think I see another > problem here, but with dmabuf even when epoll is fixed: > > dma_buf_poll() > get_file(dmabuf->file) /* f_count + 1 */ > dma_buf_poll_add_cb() > dma_resv_for_each_fence ... > dma_fence_add_callback(fence, ..., dma_buf_poll_cb) > > dma_buf_poll_cb() > ... > fput(dmabuf->file); /* f_count - 1 ... for each fence */ > > Isn't it possible to call dma_buf_poll_cb() (and therefore fput()) > multiple times if there is more than 1 fence? Perhaps I've missed a > place where a single struct dma_resv will only ever signal 1 fence? But > looking through dma_fence_signal_timestamp_locked(), I don't see > anything about resv nor somehow looking into other fence cb_list > contents... At a guess, r = dma_fence_add_callback(fence, &dcb->cb, dma_buf_poll_cb); if (!r) return true; prevents that - it returns 0 on success and -E... on error; insertion into the list happens only when it's returning 0, so...
On Fri, 3 May 2024 at 15:07, Al Viro <viro@zeniv.linux.org.uk> wrote: > > Suppose your program calls select() on a pipe and dmabuf, sees data to be read > from pipe, reads it, closes both pipe and dmabuf and exits. > > Would you expect that dmabuf file would stick around for hell knows how long > after that? I would certainly be very surprised by running into that... Why? That's the _point_ of refcounts. They make the thing they refcount stay around until it's no longer referenced. Now, I agree that dmabuf's are a bit odd in how they use a 'struct file' *as* their refcount, but hey, it's a specialty use. Unusual perhaps, but not exactly wrong. I suspect that if you saw a dmabuf just have its own 'refcount_t' and stay around until it was done, you wouldn't bat an eye at it, and it's really just the "it uses a struct file for counting" that you are reacting to. Linus
On Sat, May 04, 2024 at 12:03:18AM +0100, Al Viro wrote: > On Fri, May 03, 2024 at 03:46:25PM -0700, Kees Cook wrote: > > On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote: > > > That means that the file will be released - and it means that you have > > > violated all the refcounting rules for poll(). > > > > I feel like I've been looking at this too long. I think I see another > > problem here, but with dmabuf even when epoll is fixed: > > > > dma_buf_poll() > > get_file(dmabuf->file) /* f_count + 1 */ > > dma_buf_poll_add_cb() > > dma_resv_for_each_fence ... > > dma_fence_add_callback(fence, ..., dma_buf_poll_cb) > > > > dma_buf_poll_cb() > > ... > > fput(dmabuf->file); /* f_count - 1 ... for each fence */ > > > > Isn't it possible to call dma_buf_poll_cb() (and therefore fput()) > > multiple times if there is more than 1 fence? Perhaps I've missed a > > place where a single struct dma_resv will only ever signal 1 fence? But > > looking through dma_fence_signal_timestamp_locked(), I don't see > > anything about resv nor somehow looking into other fence cb_list > > contents... > > At a guess, > r = dma_fence_add_callback(fence, &dcb->cb, dma_buf_poll_cb); > if (!r) > return true; > > prevents that - it returns 0 on success and -E... on error; > insertion into the list happens only when it's returning 0, > so... Yes; thank you. I *have* been looking at it all too long. :) The last related thing is the drivers/gpu/drm/vmwgfx/ttm_object.c case: /** * get_dma_buf_unless_doomed - get a dma_buf reference if possible. * * @dmabuf: Non-refcounted pointer to a struct dma-buf. * * Obtain a file reference from a lookup structure that doesn't refcount * the file, but synchronizes with its release method to make sure it * has * not been freed yet. See for example kref_get_unless_zero * documentation. * Returns true if refcounting succeeds, false otherwise. * * Nobody really wants this as a public API yet, so let it mature here * for some time... */ static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) { return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; } If we end up adding epi_fget(), we'll have 2 cases of using "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed helper to live in file.h or something, with appropriate comments?
On Fri, May 03, 2024 at 04:16:15PM -0700, Linus Torvalds wrote: > On Fri, 3 May 2024 at 15:07, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Suppose your program calls select() on a pipe and dmabuf, sees data to be read > > from pipe, reads it, closes both pipe and dmabuf and exits. > > > > Would you expect that dmabuf file would stick around for hell knows how long > > after that? I would certainly be very surprised by running into that... > > Why? > > That's the _point_ of refcounts. They make the thing they refcount > stay around until it's no longer referenced. > > Now, I agree that dmabuf's are a bit odd in how they use a 'struct > file' *as* their refcount, but hey, it's a specialty use. Unusual > perhaps, but not exactly wrong. > > I suspect that if you saw a dmabuf just have its own 'refcount_t' and > stay around until it was done, you wouldn't bat an eye at it, and it's > really just the "it uses a struct file for counting" that you are > reacting to. *IF* those files are on purely internal filesystem, that's probably OK; do that with something on something mountable (char device, sysfs file, etc.) and you have a problem with filesystem staying busy. I'm really unfamiliar with the subsystem; it might be OK with all objects that use that for ->poll(), but that's definitely not a good thing to see in ->poll() instance in general. And code gets copied, so there really should be a big fat comment about the reasons why it's OK in this particular case. Said that, it seems that a better approach might be to have their ->release() cancel callbacks and drop fence references. Note that they *do* have refcounts - on fences. The file (well, dmabuf, really) is pinned only to protect against the situation when pending callback is still around. And Kees' observation about multiple fences is also interesting - we don't get extra fput(), but only because we get events only from one fence, which does look fishy...
On Fri, 3 May 2024 at 16:23, Kees Cook <keescook@chromium.org> wrote: > > static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) > { > return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; > } > > If we end up adding epi_fget(), we'll have 2 cases of using > "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed > helper to live in file.h or something, with appropriate comments? I wonder if we could try to abstract this out a bit more. These games with non-ref-counted file structures *feel* a bit like the games we play with non-ref-counted (aka "stashed") 'struct dentry' that got fairly recently cleaned up with path_from_stashed() when both nsfs and pidfs started doing the same thing. I'm not loving the TTM use of this thing, but at least the locking and logic feels a lot more straightforward (ie the atomic_long_inc_not_zero() here is clealy under the 'prime->mutex' lock IOW, the tty use looks correct to me, and it has fairly simple locking and is just catching the the race between 'fput()' decrementing the refcount and and 'file->f_op->release()' doing the actual release. You are right that it's similar to the epoll thing in that sense, it just looks a _lot_ more straightforward to me (and, unlike epoll, doesn't look actively buggy right now). Could we abstract out this kind of "stashed file pointer" so that we'd have a *common* form for this? Not just the inc_not_zero part, but the locking rule too? Linus
On Fri, 3 May 2024 at 16:39, Al Viro <viro@zeniv.linux.org.uk> wrote: > > *IF* those files are on purely internal filesystem, that's probably > OK; do that with something on something mountable (char device, > sysfs file, etc.) and you have a problem with filesystem staying > busy. Yeah, I agree, it's a bit annoying in general. That said, it's easy to do: stash a file descriptor in a unix domain socket, and that's basically exactly what you have: a random reference to a 'struct file' that will stay around for as long as you just keep that socket around, long after the "real" file descriptor has been closed, and entirely separately from it. And yes, that's exactly why unix domain socket transfers have caused so many problems over the years, with both refcount overflows and nasty garbage collection issues. So randomly taking references to file descriptors certainly isn't new. In fact, it's so common that I find the epoll pattern annoying, in that it does something special and *not* taking a ref - and it does that special thing to *other* ("innocent") file descriptors. Yes, dma-buf is a bit like those unix domain sockets in that it can keep random references alive for random times, but at least it does it just to its own file descriptors, not random other targets. So the dmabuf thing is very much a "I'm a special file that describes a dma buffer", and shouldn't really affect anything outside of active dmabuf uses (which admittedly is a large portion of the GPU drivers, and has been expanding from there...). I So the reason I'm annoyed at epoll in this case is that I think epoll triggered the bug in some entirely innocent subsystem. dma-buf is doing something differently odd, yes, but at least it's odd in a "I'm a specialized thing" sense, not in some "I screw over others" sense. Linus
On Fri, May 03, 2024 at 04:41:19PM -0700, Linus Torvalds wrote: > On Fri, 3 May 2024 at 16:23, Kees Cook <keescook@chromium.org> wrote: > > > > static bool __must_check get_dma_buf_unless_doomed(struct dma_buf *dmabuf) > > { > > return atomic_long_inc_not_zero(&dmabuf->file->f_count) != 0L; > > } > > > > If we end up adding epi_fget(), we'll have 2 cases of using > > "atomic_long_inc_not_zero" for f_count. Do we need some kind of blessed > > helper to live in file.h or something, with appropriate comments? > > I wonder if we could try to abstract this out a bit more. > > These games with non-ref-counted file structures *feel* a bit like the > games we play with non-ref-counted (aka "stashed") 'struct dentry' > that got fairly recently cleaned up with path_from_stashed() when both > nsfs and pidfs started doing the same thing. > > I'm not loving the TTM use of this thing, but at least the locking and > logic feels a lot more straightforward (ie the > atomic_long_inc_not_zero() here is clealy under the 'prime->mutex' > lock The TTM stuff is somewhat wild though and I've commented on that in https://lore.kernel.org/r/20240503-mitmachen-redakteur-2707ab0cacc3@brauner another thread that it can just use get_active_file(). Afaict, there's dma_buf_export() that allocates a new file and sets: file->private_data = dmabuf; dmabuf->file = file; dentry->d_fsdata = dmabuf; The file has f_op->release::dma_buf_file_release() as it's f_op->release method. When that's called the file's refcount is already zero but the file has not been freed yet. This will remove the dmabuf from some public list but it won't free it. dmabuf dentries have dma_buf_dentry_ops which use dentry->d_release::dma_buf_release() to release the actual dmabuf stashed in dentry->d_fsdata. So that ends up with: __fput() -> f_op->release::dma_buf_file_release() // handles file specific freeing -> dput() -> d_op->d_release::dma_buf_release() // handles dmabuf freeing // including the driver specific stuff. If you fput() the file then the dmabuf will be freed as well immediately after it when the dput() happens in __fput(). So that TTM thing does something else then in ttm_object_device_init(). It copies the dma_buf_ops into tdev->ops and replaces the dma_buf_ops release method with it's own ttm_prime_dmabuf_release() and stashes the old on in tdev->dma_buf_release. And it uses that to hook into the release path so that @dmabuf will still be valid for get_dma_buf_unless_doomed() under prime->mutex. But again, get_dma_buf_unless_doomed() can just be replaced with get_active_file() and then we're done with that part. > IOW, the tty use looks correct to me, and it has fairly simple locking > and is just catching the the race between 'fput()' decrementing the > refcount and and 'file->f_op->release()' doing the actual release. > > You are right that it's similar to the epoll thing in that sense, it > just looks a _lot_ more straightforward to me (and, unlike epoll, > doesn't look actively buggy right now). It's not buggy afaict. It literally can just switch to get_active_file() instead of open-coding it and we're done imho.
On Fri, May 03, 2024 at 02:33:37PM -0700, Linus Torvalds wrote: > On Fri, 3 May 2024 at 14:24, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > Can we get to ep_item_poll(epi, ...) after eventpoll_release_file() > > got past __ep_remove()? Because if we can, we have a worse problem - > > epi freed under us. > > Look at the hack in __ep_remove(): if it is concurrent with > eventpoll_release_file(), it will hit this code > > spin_lock(&file->f_lock); > if (epi->dying && !force) { > spin_unlock(&file->f_lock); > return false; > } > > and not free the epi. > > But as far as I can tell, almost nothing else cares about the f_lock > and dying logic. > > And in fact, I don't think doing > > spin_lock(&file->f_lock); > > is even valid in the places that look up file through "epi->ffd.file", > because the lock itself is inside the thing that you can't trust until > you've taken the lock... > > So I agree with Kees about the use of "atomic_dec_not_zero()" kind of > logic - but it also needs to be in an RCU-readlocked region, I think. Why isn't it enough to just force dma_buf_poll() to use get_file_active()? Then that whole problem goes away afaict. So the fix I had yesterday before I had to step away from the computer was literally just that [1]. It currently uses two atomic incs potentially but that can probably be fixed by the dma folks to be smarter about when they actually need to take a file reference. > > I wish epoll() just took the damn file ref itself. But since it relies > on the file refcount to release the data structure, that obviously > can't work. > > Linus diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 8fe5aa67b167..7149c45976e1 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -244,13 +244,18 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) if (!dmabuf || !dmabuf->resv) return EPOLLERR; + if (!get_file_active(&dmabuf->file)) + return EPOLLERR; + resv = dmabuf->resv; poll_wait(file, &dmabuf->poll, poll); events = poll_requested_events(poll) & (EPOLLIN | EPOLLOUT); - if (!events) + if (!events) { + fput(file); return 0; + } dma_resv_lock(resv, NULL); @@ -268,7 +273,6 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) if (events & EPOLLOUT) { /* Paired with fput in dma_buf_poll_cb */ get_file(dmabuf->file); - if (!dma_buf_poll_add_cb(resv, true, dcb)) /* No callback queued, wake up any other waiters */ dma_buf_poll_cb(NULL, &dcb->cb); @@ -301,6 +305,7 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) } dma_resv_unlock(resv); + fput(file); return events; }
On Sat, May 04, 2024 at 12:39:00AM +0100, Al Viro wrote: > On Fri, May 03, 2024 at 04:16:15PM -0700, Linus Torvalds wrote: > > On Fri, 3 May 2024 at 15:07, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > > > Suppose your program calls select() on a pipe and dmabuf, sees data to be read > > > from pipe, reads it, closes both pipe and dmabuf and exits. > > > > > > Would you expect that dmabuf file would stick around for hell knows how long > > > after that? I would certainly be very surprised by running into that... > > > > Why? > > > > That's the _point_ of refcounts. They make the thing they refcount > > stay around until it's no longer referenced. > > > > Now, I agree that dmabuf's are a bit odd in how they use a 'struct > > file' *as* their refcount, but hey, it's a specialty use. Unusual > > perhaps, but not exactly wrong. > > > > I suspect that if you saw a dmabuf just have its own 'refcount_t' and > > stay around until it was done, you wouldn't bat an eye at it, and it's > > really just the "it uses a struct file for counting" that you are > > reacting to. > > *IF* those files are on purely internal filesystem, that's probably > OK; do that with something on something mountable (char device, > sysfs file, etc.) and you have a problem with filesystem staying > busy. In this instance it is ok because dma-buf is an internal fs. I had the exact same reaction you had initially but it doesn't matter for dma-buf afaict as that thing can never be unmounted.
On Sat, 4 May 2024 at 02:37, Christian Brauner <brauner@kernel.org> wrote: > > --- a/drivers/dma-buf/dma-buf.c > +++ b/drivers/dma-buf/dma-buf.c > @@ -244,13 +244,18 @@ static __poll_t dma_buf_poll(struct file *file, poll_table *poll) > if (!dmabuf || !dmabuf->resv) > return EPOLLERR; > > + if (!get_file_active(&dmabuf->file)) > + return EPOLLERR; [...] I *really* don't think anything that touches dma-buf.c can possibly be right. This is not a dma-buf.c bug. This is purely an epoll bug. Lookie here, the fundamental issue is that epoll can call '->poll()' on a file descriptor that is being closed concurrently. That means that *ANY* driver that relies on *any* data structure that is managed by the lifetime of the 'struct file' will have problems. Look, here's sock_poll(): static __poll_t sock_poll(struct file *file, poll_table *wait) { struct socket *sock = file->private_data; and that first line looks about as innocent as it possibly can, right? Now, imagine that this is called from 'epoll' concurrently with the file being closed for the last time (but it just hasn't _quite_ reached eventpoll_release() yet). Now, imagine that the kernel is built with preemption, and the epoll thread gets preempted _just_ before it loads 'file->private_data'. Furthermore, the machine is under heavy load, and it just stays off its CPU a long time. Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the file closing completes, eventpoll_release() finishes, and the preemption of the poll() thing just takes so long that you go through an RCU period too, so that the actual file has been released too. So now that totally innoced file->private_data load in the poll() is probably going to get random data. Yes, the file is allocated as SLAB_TYPESAFE_BY_RCU, so it's probably still a file. Not guaranteed, even the slab will get fully free'd in some situations. And yes, the above case is impossible to hit in practice. You have to hit quite the small race window with an operation that practically never happens in the first place. But my point is that the fact that the problem with file->f_count lifetimes happens for that dmabuf driver is not the fault of the dmabuf code. Not at all. It is *ENTIRELY* a bug in epoll, and the dmabuf code is probably just easier to hit because it has a poll() function that does things that have longer lifetimes than most things, and interacts more directly with that f_count. So I really don't understand why Al thinks this is "dmabuf does bad things with f_count". It damn well does not. dma-buf is the GOOD GUY here. It's doing things *PROPERLY*. It's taking refcounts like it damn well should. The fact that it takes ref-counts on something that the epoll code has messed up is *NOT* its fault. Linus
On Sat, 4 May 2024 at 08:32, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Now, during this TOTALLY INNOCENT sock_poll(), in another thread, the > file closing completes, eventpoll_release() finishes [..] Actually, Al is right that ep_item_poll() should be holding the ep->mtx, so eventpoll_release() -> eventpoll_release_file_file() -> mutex_lock(&ep->mtx) should block and the file doesn't actually get released. So I guess the sock_poll() issue cannot happen. It does need some poll() function that does 'fget()', and believes that it works. But because the f_count has already gone down to zero, fget() doesn't work, and doesn't keep the file around, and you have the bug. The cases that do fget() in poll() are probably race, but they aren't buggy. epoll is buggy. So my example wasn't going to work, but the argument isn't really any different, it's just a much more limited case that breaks. And maybe it's even *only* dma-buf that does that fget() in its ->poll() function. Even *then* it's not a dma-buf.c bug. Linus
On Sat, 4 May 2024 at 08:40, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And maybe it's even *only* dma-buf that does that fget() in its > ->poll() function. Even *then* it's not a dma-buf.c bug. They all do in the sense that they do poll_wait -> __pollwait -> get_file (*boom*) but the boom is very small because the poll_wait() will be undone by poll_freewait(), and normally poll/select has held the file count elevated. Except for epoll. Which leaves those pollwait entries around until it's done - but again will be held up on the ep->mtx before it does so. So everybody does some f_count games, but possibly dma-buf is the only one that ends up expecting to hold on to the f_count for longer periods. Linus
On Sat, 4 May 2024 at 08:32, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Lookie here, the fundamental issue is that epoll can call '->poll()' > on a file descriptor that is being closed concurrently. Thinking some more about this, and replying to myself... Actually, I wonder if we could *really* fix this by simply moving the eventpoll_release() to where it really belongs. If we did it in file_close_fd_locked(), it would actually make a *lot* more sense. Particularly since eventpoll actually uses this: struct epoll_filefd { struct file *file; int fd; } __packed; ie it doesn't just use the 'struct file *', it uses the 'fd' itself (for ep_find()). (Strictly speaking, it should also have a pointer to the 'struct files_struct' to make the 'int fd' be meaningful). IOW, eventpoll already considers the file _descriptor_ relevant, not just the file pointer, and that's destroyed at *close* time, not at 'fput()' time. Yeah, yeah, the locking situation in file_close_fd_locked() is a bit inconvenient, but if we can solve that, it would solve the problem in a fundamentally different way: remove the ep iterm before the file->f_count has actually been decremented, so the whole "race with fput()" would just go away entirely. I dunno. I think that would be the right thing to do, but I wouldn't be surprised if some disgusting eventpoll user then might depend on the current situation where the eventpoll thing stays around even after the close() if you have another copy of the file open. Linus
On Sun, 5 May 2024 at 03:50, Christian Brauner <brauner@kernel.org> wrote: > > And I agree with you that for some instances it's valid to take another > reference to a file from f_op->poll() but then they need to use > get_file_active() imho and simply handle the case where f_count is zero. I think this is (a) practically impossible to find (since most f_count updates are in various random helpers) (b) not tenable in the first place, since *EVERYBODY* does a f_count update as part of the bog-standard pollwait So (b) means that the notion of "warn if somebody increments f_count from zero" is broken to begin with - but it's doubly broken because it wouldn't find anything *anyway*, since this never happens in any normal situation. And (a) means that any non-automatic finding of this is practically impossible. > And we need to document that in Documentation/filesystems/file.rst or > locking.rst. WHY? Why cannot you and Al just admit that the problem is in epoll. Always has been, always will be. The fact is, it's not dma-buf that is violating any rules. It's epoll. It's calling out to random driver functions with a file pointer that is no longer valid. It really is that simple. I don't see why you are arguing for "unknown number of drivers - we know at least *one* - have to be fixed for a bug that is in epoll". If it was *easy* to fix, and if it was *easy* to validate, then sure. But that just isn't the case. In contrast, in epoll it's *trivial* to fix the one case where it does a VFS call-out, and just say "you have to follow the rules". So explain to me again why you want to mess up the driver interface and everybody who has a '.poll()' function, and not just fix the ONE clearly buggy piece of code. Because dammit,. epoll is clearly buggy. It's not enough to say "the file allocation isn't going away", and claim that that means that it's not buggy - when the file IS NO LONGER VALID! Linus
On Sun, May 05, 2024 at 09:46:05AM -0700, Linus Torvalds wrote: > WHY? > > Why cannot you and Al just admit that the problem is in epoll. Always > has been, always will be. Nobody (well, nobody who'd ever read epoll) argues that epoll is not a problem. > The fact is, it's not dma-buf that is violating any rules. Now, that is something I've a trouble with. Use of get_file() in there actually looks rather fishy, regardless of epoll. At the very least it needs a comment discouraging other instances from blindly copying this. A reference to struct file pins down more than driver-internal objects; if nothing else, it pins down a mount and if you don't have SB_NOUSER on file_inode(file)->i_sb->s_flags, it's really not a good idea. What's more, the reason for that get_file() is, AFAICS, that nodes we put into callback queue for fence(s) in question[*] are embedded into dmabuf and we don't want them gone before the callbacks have happened. Which looks fishy - it would make more sense to cancel these callbacks and drop the fence(s) in question from ->release(). I've no problem whatsoever with fs/eventpoll.c grabbing/dropping file reference around vfs_poll() calls. And I don't believe that "try to grab" has any place in dma_buf_poll(); it's just that I'm not happy about get_file() call being there in the first place. Sure, the call of ->poll() can bloody well lead to references being grabbed - by the pollwait callback, which the caller of ->poll() is aware of. It's ->poll() instance *itself* grabbing such references with vfs_poll() caller having no idea what's going on that has potential for being unpleasant. And we can't constify 'file' argument of ->poll() because of poll_wait(), so it's hard to catch those who do that kind of thing; I've explicitly said so upthread, I believe. But similar calls of get_file() in ->poll() instances (again, not the ones that are made by pollwait callback) are something to watch out for and having the caller pin struct file does not solve the problem. [*] at most one per direction, and I've no idea whether there can be more than one signalling fence for given dmabuf)
On Sun, May 05, 2024 at 01:03:07PM -0700, Linus Torvalds wrote: > On Sun, 5 May 2024 at 12:46, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > I've no problem with having epoll grab a reference, but if we make that > > a universal requirement ->poll() instances can rely upon, > > Al, we're note "making that a requirement". > > It always has been. Argh. We keep talking past each other. 0. special-cased ->f_count rule for ->poll() is a wart and it's better to get rid of it. 1. fs/eventpoll.c is a steaming pile of shit and I'd be glad to see git rm taken to it. Short of that, by all means, let's grab reference in there around the call of vfs_poll() (see (0)). 2. having ->poll() instances grab extra references to file passed to them is not something that should be encouraged; there's a plenty of potential problems, and "caller has it pinned, so we are fine with grabbing extra refs" is nowhere near enough to eliminate those. 3. dma-buf uses of get_file() are probably safe (epoll shite aside), but they do look fishy. That has nothing to do with epoll.
On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote: > > The fact is, it's not dma-buf that is violating any rules. It's epoll. > > I agree that epoll() not taking a reference on the file is at least > unexpected and contradicts the usual code patterns for the sake of > performance and that it very likely is the case that most callers of > f_op->poll() don't know this. > > Note, I cleary wrote upthread that I'm ok to do it like you suggested > but raised two concerns a) there's currently only one instance of > prolonged @file lifetime in f_op->poll() afaict and b) that there's > possibly going to be some performance impact on epoll(). > > So it's at least worth discussing what's more important because epoll() > is very widely used and it's not that we haven't favored performance > before. > > But you've already said that you aren't concerned with performance on > epoll() upthread. So afaict then there's really not a lot more to > discuss other than take the patch and see whether we get any complaints. Two closing thoughts: (1) I wonder if this won't cause userspace regressions for the semantics of epoll because dying files are now silently ignored whereas before they'd generated events. (2) The other part is that this seems to me that epoll() will now temporarly pin filesystems opening up the possibility for spurious EBUSY errors. If you register a file descriptor in an epoll instance and then close it and umount the filesystem but epoll managed to do an fget() on that fd before that close() call then epoll will pin that filesystem. If the f_op->poll() method does something that can take a while (blocks on a shared mutex of that subsystem) that umount is very likely going to return EBUSY suddenly. Afaict, before that this wouldn't have been an issue at all and is likely more serious than performance. (One option would be to only do epi_fget() for stuff like dma-buf that's never unmounted. That'll cover nearly every driver out there. Only "real" filesystems would have to contend with @file count going to zero but honestly they also deal with dentry lookup under RCU which is way more adventurous than this.) Maybe I'm barking up the wrong tree though.
On Sun, May 05, 2024 at 01:53:48PM -0700, Linus Torvalds wrote: > On Sun, 5 May 2024 at 13:30, Al Viro <viro@zeniv.linux.org.uk> wrote: > > > > 0. special-cased ->f_count rule for ->poll() is a wart and it's > > better to get rid of it. > > > > 1. fs/eventpoll.c is a steaming pile of shit and I'd be glad to see > > git rm taken to it. Short of that, by all means, let's grab reference > > in there around the call of vfs_poll() (see (0)). > > Agreed on 0/1. > > > 2. having ->poll() instances grab extra references to file passed > > to them is not something that should be encouraged; there's a plenty > > of potential problems, and "caller has it pinned, so we are fine with > > grabbing extra refs" is nowhere near enough to eliminate those. > > So it's not clear why you hate it so much, since those extra > references are totally normal in all the other VFS paths. > > I mean, they are perhaps not the *common* case, but we have a lot of > random get_file() calls sprinkled around in various places when you > end up passing a file descriptor off to some asynchronous operation > thing. > > Yeah, I think most of them tend to be special operations (eg the tty > TIOCCONS ioctl to redirect the console), but it's not like vfs_ioctl() > is *that* different from vfs_poll. Different operation, not somehow > "one is more special than the other". > > cachefiles and backing-file does it for regular IO, and drop it at IO > completion - not that different from what dma-buf does. It's in > ->read_iter() rather than ->poll(), but again: different operations, > but not "one of them is somehow fundamentally different". > > > 3. dma-buf uses of get_file() are probably safe (epoll shite aside), > > but they do look fishy. That has nothing to do with epoll. > > Now, what dma-buf basically seems to do is to avoid ref-counting its > own fundamental data structure, and replaces that by refcounting the > 'struct file' that *points* to it instead. > > And it is a bit odd, but it actually makes some amount of sense, > because then what it passes around is that file pointer (and it allows > passing it around from user space *as* that file). > > And honestly, if you look at why it then needs to add its refcount to > it all, it actually makes sense. dma-bufs have this notion of > "fences" that are basically completion points for the asynchronous > DMA. Doing a "poll()" operation will add a note to the fence to get > that wakeup when it's done. > > And yes, logically it takes a ref to the "struct dma_buf", but because > of how the lifetime of the dma_buf is associated with the lifetime of > the 'struct file', that then turns into taking a ref on the file. > > Unusual? Yes. But not illogical. Not obviously broken. Tying the > lifetime of the dma_buf to the lifetime of a file that is passed along > makes _sense_ for that use. > > I'm sure dma-bufs could add another level of refcounting on the > 'struct dma_buf' itself, and not make it be 1:1 with the file, but > it's not clear to me what the advantage would really be, or why it > would be wrong to re-use a refcount that is already there. So there is generally another refcount, because dma_buf is just the cross-driver interface to some kind of real underlying buffer object from the various graphics related subsystems we have. And since it's a pure file based api thing that ceases to serve any function once the fd/file is gone we tied all the dma_buf refcounting to the refcount struct file already maintains. But the underlying buffer object can easily outlive the dma_buf, and over the lifetime of an underlying buffer object you might actually end up creating different dma_buf api wrappers for it (but at least in drm we guarantee there's at most one, hence why vmwgfx does the atomic_inc_unless_zero trick, which I don't particularly like and isn't really needed). But we could add another refcount, it just means we have 3 of those then when only really 2 are needed. Also maybe here two: dma_fence are bounded like other disk i/o (including the option of timeouts if things go very wrong), so it's very much not forever but at most a few seconds worst case (shit hw/driver excluded, as usual). -Sima
On Mon, May 06, 2024 at 11:27:04AM +0200, Christian Brauner wrote: > On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote: > > > The fact is, it's not dma-buf that is violating any rules. It's epoll. > > > > I agree that epoll() not taking a reference on the file is at least > > unexpected and contradicts the usual code patterns for the sake of > > performance and that it very likely is the case that most callers of > > f_op->poll() don't know this. > > > > Note, I cleary wrote upthread that I'm ok to do it like you suggested > > but raised two concerns a) there's currently only one instance of > > prolonged @file lifetime in f_op->poll() afaict and b) that there's > > possibly going to be some performance impact on epoll(). > > > > So it's at least worth discussing what's more important because epoll() > > is very widely used and it's not that we haven't favored performance > > before. > > > > But you've already said that you aren't concerned with performance on > > epoll() upthread. So afaict then there's really not a lot more to > > discuss other than take the patch and see whether we get any complaints. > > Two closing thoughts: > > (1) I wonder if this won't cause userspace regressions for the semantics > of epoll because dying files are now silently ignored whereas before > they'd generated events. > > (2) The other part is that this seems to me that epoll() will now > temporarly pin filesystems opening up the possibility for spurious > EBUSY errors. > > If you register a file descriptor in an epoll instance and then > close it and umount the filesystem but epoll managed to do an fget() > on that fd before that close() call then epoll will pin that > filesystem. > > If the f_op->poll() method does something that can take a while > (blocks on a shared mutex of that subsystem) that umount is very > likely going to return EBUSY suddenly. > > Afaict, before that this wouldn't have been an issue at all and is > likely more serious than performance. > > (One option would be to only do epi_fget() for stuff like > dma-buf that's never unmounted. That'll cover nearly every > driver out there. Only "real" filesystems would have to contend with > @file count going to zero but honestly they also deal with dentry > lookup under RCU which is way more adventurous than this.) > > Maybe I'm barking up the wrong tree though. Sorry, had to step out for an appointment. Under the assumption that I'm not entirely off with this - and I really could be ofc - then one possibility would be that we enable persistence of files from f_op->poll() for SB_NOUSER filesystems. That'll catch everything that's relying on anonymous inodes (drm and all drivers) and init_pseudo() so everything that isn't actually unmountable (pipefs, pidfs, sockfs, etc.). So something like the _completely untested_ diff on top of your proposal above: diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a3f0f868adc4..95968a462544 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1018,8 +1018,24 @@ static struct file *epi_fget(const struct epitem *epi) static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct file *file = epi_fget(epi); + struct file *file = epi->ffd.file; __poll_t res; + bool unrefd = false; + + /* + * Taking a reference for anything that isn't mountable is fine + * because we don't have to worry about spurious EBUSY warnings + * from umount(). + * + * File count might go to zero in f_op->poll() for mountable + * filesystems. + */ + if (file->f_inode->i_sb->s_flags & SB_NOUSER) { + unrefd = true; + file = epi_fget(epi); + } else if (file_count(file) == 0) { + file = NULL; + } /* * We could return EPOLLERR | EPOLLHUP or something, @@ -1034,7 +1050,9 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, res = vfs_poll(file, pt); else res = __ep_eventpoll_poll(file, pt, depth); - fput(file); + + if (unrefd) + fput(file); return res & epi->event.events; } Basically, my worry is that we end up with really annoying to debug EBUSYs caused by epoll(). I'd really like to avoid that. But again, I might be wrong and this isn't an issue.
Am 04.05.24 um 20:20 schrieb Linus Torvalds: > On Sat, 4 May 2024 at 08:32, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> Lookie here, the fundamental issue is that epoll can call '->poll()' >> on a file descriptor that is being closed concurrently. > Thinking some more about this, and replying to myself... > > Actually, I wonder if we could *really* fix this by simply moving the > eventpoll_release() to where it really belongs. > > If we did it in file_close_fd_locked(), it would actually make a > *lot* more sense. Particularly since eventpoll actually uses this: > > struct epoll_filefd { > struct file *file; > int fd; > } __packed; > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself > (for ep_find()). > > (Strictly speaking, it should also have a pointer to the 'struct > files_struct' to make the 'int fd' be meaningful). While I completely agree on this I unfortunately have to ruin the idea. Before we had KCMP some people relied on the strange behavior of eventpoll to compare struct files when the fd is the same. I just recently suggested that solution to somebody at AMD as a workaround when KCMP is disabled because of security hardening and I'm pretty sure I've seen it somewhere else as well. So when we change that it would break (undocumented?) UAPI behavior. Regards, Christian. > > IOW, eventpoll already considers the file _descriptor_ relevant, not > just the file pointer, and that's destroyed at *close* time, not at > 'fput()' time. > > Yeah, yeah, the locking situation in file_close_fd_locked() is a bit > inconvenient, but if we can solve that, it would solve the problem in > a fundamentally different way: remove the ep iterm before the > file->f_count has actually been decremented, so the whole "race with > fput()" would just go away entirely. > > I dunno. I think that would be the right thing to do, but I wouldn't > be surprised if some disgusting eventpoll user then might depend on > the current situation where the eventpoll thing stays around even > after the close() if you have another copy of the file open. > > Linus > _______________________________________________ > Linaro-mm-sig mailing list -- linaro-mm-sig@lists.linaro.org > To unsubscribe send an email to linaro-mm-sig-leave@lists.linaro.org
On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote: > > It's really annoying that on some distros/builds we don't have that, and > for gpu driver stack reasons we _really_ need to know whether a fd is the > same as another, due to some messy uniqueness requirements on buffer > objects various drivers have. It's sad that such a simple thing would require two other horrid models (EPOLL or KCMP). There'[s a reason that KCMP is a config option - *some* of that is horrible code - but the "compare file descriptors for equality" is not that reason. Note that KCMP really is a broken mess. It's also a potential security hole, even for the simple things, because of how it ends up comparing kernel pointers (ie it doesn't just say "same file descriptor", it gives an ordering of them, so you can use KCMP to sort things in kernel space). And yes, it orders them after obfuscating the pointer, but it's still not something I would consider sane as a baseline interface. It was designed for checkpoint-restore, it's the wrong thing to use for some "are these file descriptors the same". The same argument goes for using EPOLL for that. Disgusting hack. Just what are the requirements for the GPU stack? Is one of the file descriptors "trusted", IOW, you know what kind it is? Because dammit, it's *so* easy to do. You could just add a core DRM ioctl for it. Literally just struct fd f1 = fdget(fd1); struct fd f2 = fdget(fd2); int same; same = f1.file && f1.file == f2.file; fdput(fd1); fdput(fd2); return same; where the only question is if you also woudl want to deal with O_PATH fd's, in which case the "fdget()" would be "fdget_raw()". Honestly, adding some DRM ioctl for this sounds hacky, but it sounds less hacky than relying on EPOLL or KCMP. I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl too, if this is possibly a more common thing. and not just DRM wants it. Would something like that work for you? Linus
On Tue, May 07, 2024 at 09:46:31AM -0700, Linus Torvalds wrote: > On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote: > > > > It's really annoying that on some distros/builds we don't have that, and > > for gpu driver stack reasons we _really_ need to know whether a fd is the > > same as another, due to some messy uniqueness requirements on buffer > > objects various drivers have. > > It's sad that such a simple thing would require two other horrid > models (EPOLL or KCMP). > > There'[s a reason that KCMP is a config option - *some* of that is > horrible code - but the "compare file descriptors for equality" is not > that reason. > > Note that KCMP really is a broken mess. It's also a potential security > hole, even for the simple things, because of how it ends up comparing > kernel pointers (ie it doesn't just say "same file descriptor", it > gives an ordering of them, so you can use KCMP to sort things in > kernel space). > > And yes, it orders them after obfuscating the pointer, but it's still > not something I would consider sane as a baseline interface. It was > designed for checkpoint-restore, it's the wrong thing to use for some > "are these file descriptors the same". > > The same argument goes for using EPOLL for that. Disgusting hack. > > Just what are the requirements for the GPU stack? Is one of the file > descriptors "trusted", IOW, you know what kind it is? > > Because dammit, it's *so* easy to do. You could just add a core DRM > ioctl for it. Literally just > > struct fd f1 = fdget(fd1); > struct fd f2 = fdget(fd2); > int same; > > same = f1.file && f1.file == f2.file; > fdput(fd1); > fdput(fd2); > return same; > > where the only question is if you also woudl want to deal with O_PATH > fd's, in which case the "fdget()" would be "fdget_raw()". > > Honestly, adding some DRM ioctl for this sounds hacky, but it sounds > less hacky than relying on EPOLL or KCMP. Well, in slightly more code (because it's part of the "import this dma-buf/dma-fence/whatever fd into a driver object" ioctl) this is what we do. The issue is that there's generic userspace (like compositors) that sees these things fly by and would also like to know whether the other side they receive them from is doing nasty stuff/buggy/evil. And they don't have access to the device drm fd (since those are a handful of layers away behind the opengl/vulkan userspace drivers even if the compositor could get at them, and in some cases not even that). So if we do this in drm we'd essentially have to create a special drm_compare_files chardev, put the ioctl there and then tell everyone to make that thing world-accessible. Which is just too close to a real syscall that it's offensive, and hey kcmp does what we want already (but unfortunately also way more). So we rejected adding that to drm. But we did think about it. > I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl > too, if this is possibly a more common thing. and not just DRM wants > it. > > Would something like that work for you? Yes. Adding Simon and Pekka as two of the usual suspects for this kind of stuff. Also example code (the int return value is just so that callers know when kcmp isn't available, they all only care about equality): https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/util/os_file.c#L239 -Sima -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 08.05.24 um 10:23 schrieb Christian Brauner: > On Tue, May 07, 2024 at 07:45:02PM +0200, Christian König wrote: >> Am 07.05.24 um 18:46 schrieb Linus Torvalds: >>> On Tue, 7 May 2024 at 04:03, Daniel Vetter <daniel@ffwll.ch> wrote: >>>> It's really annoying that on some distros/builds we don't have that, and >>>> for gpu driver stack reasons we _really_ need to know whether a fd is the >>>> same as another, due to some messy uniqueness requirements on buffer >>>> objects various drivers have. >>> It's sad that such a simple thing would require two other horrid >>> models (EPOLL or KCMP). >>> >>> There'[s a reason that KCMP is a config option - *some* of that is >>> horrible code - but the "compare file descriptors for equality" is not >>> that reason. >>> >>> Note that KCMP really is a broken mess. It's also a potential security >>> hole, even for the simple things, because of how it ends up comparing >>> kernel pointers (ie it doesn't just say "same file descriptor", it >>> gives an ordering of them, so you can use KCMP to sort things in >>> kernel space). >>> >>> And yes, it orders them after obfuscating the pointer, but it's still >>> not something I would consider sane as a baseline interface. It was >>> designed for checkpoint-restore, it's the wrong thing to use for some >>> "are these file descriptors the same". >>> >>> The same argument goes for using EPOLL for that. Disgusting hack. >>> >>> Just what are the requirements for the GPU stack? Is one of the file >>> descriptors "trusted", IOW, you know what kind it is? >>> >>> Because dammit, it's *so* easy to do. You could just add a core DRM >>> ioctl for it. Literally just >>> >>> struct fd f1 = fdget(fd1); >>> struct fd f2 = fdget(fd2); >>> int same; >>> >>> same = f1.file && f1.file == f2.file; >>> fdput(fd1); >>> fdput(fd2); >>> return same; >>> >>> where the only question is if you also woudl want to deal with O_PATH >>> fd's, in which case the "fdget()" would be "fdget_raw()". >>> >>> Honestly, adding some DRM ioctl for this sounds hacky, but it sounds >>> less hacky than relying on EPOLL or KCMP. >>> >>> I'd be perfectly ok with adding a generic "FISAME" VFS level ioctl >>> too, if this is possibly a more common thing. and not just DRM wants >>> it. >>> >>> Would something like that work for you? >> Well the generic approach yes, the DRM specific one maybe. IIRC we need to >> be able to compare both DRM as well as DMA-buf file descriptors. >> >> The basic problem userspace tries to solve is that drivers might get the >> same fd through two different code paths. >> >> For example application using OpenGL/Vulkan for rendering and VA-API for >> video decoding/encoding at the same time. >> >> Both APIs get a fd which identifies the device to use. It can be the same, >> but it doesn't have to. >> >> If it's the same device driver connection (or in kernel speak underlying >> struct file) then you can optimize away importing and exporting of buffers >> for example. >> >> Additional to that it makes cgroup accounting much easier because you don't >> count things twice because they are shared etc... > One thing to keep in mind is that a generic VFS level comparing function > will only catch the obvious case where you have dup() equivalency as > outlined above by Linus. That's what most people are interested in and > that could easily replace most kcmp() use-cases for comparing fds. > > But, of course there's the case where you have two file descriptors > referring to two different files that reference the same underlying > object (usually stashed in file->private_data). > > For most cases that problem can ofc be solved by comparing the > underlying inode. But that doesn't work for drivers using the generic > anonymous inode infrastructure because it uses the same inode for > everything or for cases where the same underlying object can even be > represented by different inodes. > > So for such cases a driver specific ioctl() to compare two fds will > be needed in addition to the generic helper. At least for the DRM we already have some solution for that, see drmGetPrimaryDeviceNameFromFd() for an example. Basically the file->private_data is still something different, but we use this to figure out if we have two file descriptors (with individual struct files underneath) pointing to the same hw driver. This is important if you need to know if just importing/exporting of DMA-buf handles between the two file descriptors is enough or if you deal with two different hw devices and need to do stuff like format conversion etc... Regards, Christian.
On Mon, May 06, 2024 at 04:29:44PM +0200, Christian König wrote: > Am 04.05.24 um 20:20 schrieb Linus Torvalds: > > On Sat, 4 May 2024 at 08:32, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > Lookie here, the fundamental issue is that epoll can call '->poll()' > > > on a file descriptor that is being closed concurrently. > > Thinking some more about this, and replying to myself... > > > > Actually, I wonder if we could *really* fix this by simply moving the > > eventpoll_release() to where it really belongs. > > > > If we did it in file_close_fd_locked(), it would actually make a > > *lot* more sense. Particularly since eventpoll actually uses this: > > > > struct epoll_filefd { > > struct file *file; > > int fd; > > } __packed; > > > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself > > (for ep_find()). > > > > (Strictly speaking, it should also have a pointer to the 'struct > > files_struct' to make the 'int fd' be meaningful). > > While I completely agree on this I unfortunately have to ruin the idea. > > Before we had KCMP some people relied on the strange behavior of eventpoll > to compare struct files when the fd is the same. > > I just recently suggested that solution to somebody at AMD as a workaround > when KCMP is disabled because of security hardening and I'm pretty sure I've > seen it somewhere else as well. > > So when we change that it would break (undocumented?) UAPI behavior. I've worked on that a bit yesterday and I learned new things about epoll and ran into some limitations. Like, what happens if process P1 has a file descriptor registered in an epoll instance and now P1 forks and creates P2. So every file that P1 maintains gets copied into a new file descriptor table for P2. And the same file descriptors refer to the same files for both P1 and P2. So there's two interesting cases here: (1) P2 explicitly removes the file descriptor from the epoll instance via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2 since the <fd, file> pair is only registered once and it isn't marked whether it belongs to P1 and P2 fdtable. So effectively fork()ing with epoll creates a weird shared state where removal of file descriptors that were registered before the fork() affects both child and parent. I found that surprising even though I've worked with epoll quite extensively in low-level userspace. (2) P2 doesn't close it's file descriptors. It just exits. Since removal of the file descriptor from the epoll instance isn't done during close() but during last fput() P1's epoll state remains unaffected by P2's sloppy exit because P1 still holds references to all files in its fdtable. (Sidenote, if one ends up adding every more duped-fds into epoll instance that one doesn't explicitly close and all of them refer to the same file wouldn't one just be allocating new epitems that are kept around for a really long time?) So if the removal of the fd would now be done during close() or during exit_files() when we call close_files() and since there's currently no way of differentiating whether P1 or P2 own that fd it would mean that (2) collapses into (1) and we'd always alter (1)'s epoll state. That would be a UAPI break. So say we record the fdtable to get ownership of that file descriptor so P2 doesn't close anything in (2) that really belongs to P1 to fix that problem. But afaict, that would break another possible use-case. Namely, where P1 creates an epoll instance and registeres fds and then fork()s to create P2. Now P1 can exit and P2 takes over the epoll loop of P1. This wouldn't work anymore because P1 would deregister all fds it owns in that epoll instance during exit. I didn't see an immediate nice way of fixing that issue. But note that taking over an epoll loop from the parent doesn't work reliably for some file descriptors. Consider man signalfd(2): epoll(7) semantics If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance, then epoll_wait(2) returns events only for signals sent to that process. In particular, if the process then uses fork(2) to create a child process, then the child will be able to read(2) signals that are sent to it using the signalfd file descriptor, but epoll_wait(2) will not indicate that the signalfd file descriptor is ready. In this scenario, a possible workaround is that after the fork(2), the child process can close the signalfd file descriptor that it inherited from the parent process and then create another signalfd file descriptor and add it to the epoll instance. Alternatively, the parent and the child could delay creating their (separate) signalfd file descriptors and adding them to the epoll instance until after the call to fork(2). So effectively P1 opens a signalfd and registers it in an epoll instance. Then it fork()s and creates P2. Now both P1 and P2 call epoll_wait(). Since signalfds are always relative to the caller and P1 did call signalfd_poll() to register the callback only P1 can get events. So P2 can't take over signalfds in that epoll loop. Honestly, the inheritance semantics of epoll across fork() seem pretty wonky and it would've been better if an epoll fd inherited across would've returned ESTALE or EINVAL or something. And if that inheritance of epoll instances would really be a big use-case there'd be some explicit way to enable this.
On Tue, 7 May 2024 at 12:07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > That example thing shows that we shouldn't make it a FISAME ioctl - we > should make it a fcntl() instead, and it would just be a companion to > F_DUPFD. > > Doesn't that strike everybody as a *much* cleaner interface? I think > F_ISDUP would work very naturally indeed with F_DUPFD. So since we already have two versions of F_DUPFD (the other being F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend on that existing naming pattern, and called it F_DUPFD_QUERY instead. I'm not married to the name, so if somebody hates it, feel free to argue otherwise. But with that, the suggested patch would end up looking something like the attached (I also re-ordered the existing "F_LINUX_SPECIFIC_BASE" users, since one of them was out of numerical order). This really feels like a very natural thing, and yes, the 'same_fd()' function in systemd that Christian also pointed at could use this very naturally. Also note that I obviously haven't tested this. Because obviously this is trivially correct and cannot possibly have any bugs. Right? RIGHT? And yes, I did check - despite the odd jump in numbers, we've never had anything between F_NOTIFY (+2) and F_CANCELLK (+5). We added F_SETLEASE (+0) , F_GETLEASE (+1) and F_NOTIFY (+2) in 2.4.0-test9 (roughly October 2000, I didn't dig deeper). And then back in 2007 we suddenly jumped to F_CANCELLK (+5) in commit 9b9d2ab4154a ("locks: add lock cancel command"). I don't know why 3/4 were shunned. After that we had 22d2b35b200f ("F_DUPFD_CLOEXEC implementation") add F_DUPFD_CLOEXEC (+6). I'd have loved to put F_DUPFD_QUERY next to it, but +5 and +7 are both used. Linus
On Wed, 8 May 2024 at 09:19, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So since we already have two versions of F_DUPFD (the other being > F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend > on that existing naming pattern, and called it F_DUPFD_QUERY instead. > > I'm not married to the name, so if somebody hates it, feel free to > argue otherwise. Side note: with this patch, doing ret = fcntl(fd1, F_DUPFD_QUERY, fd2); will result in: -1 (EBADF): 'fd1' is not a valid file descriptor -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY 0: fd2 does not refer to the same file as fd1 1: fd2 is the same 'struct file' as fd1 and it might be worth noting a couple of things here: (a) fd2 being an invalid file descriptor does not cause EBADF, it just causes "does not match". (b) we *could* use more bits for more equality IOW, it would possibly make sense to extend the 0/1 result to be - bit #0: same file pointer - bit #1: same path - bit #2: same dentry - bit #3: same inode which are all different levels of "sameness". Does anybody care? Do we want to extend on this "sameness"? I'm not convinced, but it might be a good idea to document this as a possibly future extension, ie *if* what you care about is "same file pointer", maybe you should make sure to only look at bit #0. Linus
On Wed, May 08, 2024 at 10:14:44AM -0700, Linus Torvalds wrote: > On Wed, 8 May 2024 at 09:19, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > So since we already have two versions of F_DUPFD (the other being > > F_DUPFD_CLOEXEC) I decided that the best thing to do is to just extend > > on that existing naming pattern, and called it F_DUPFD_QUERY instead. > > > > I'm not married to the name, so if somebody hates it, feel free to > > argue otherwise. > > Side note: with this patch, doing > > ret = fcntl(fd1, F_DUPFD_QUERY, fd2); > > will result in: > > -1 (EBADF): 'fd1' is not a valid file descriptor > -1 (EINVAL): old kernel that doesn't support F_DUPFD_QUERY > 0: fd2 does not refer to the same file as fd1 > 1: fd2 is the same 'struct file' as fd1 > > and it might be worth noting a couple of things here: > > (a) fd2 being an invalid file descriptor does not cause EBADF, it > just causes "does not match". > > (b) we *could* use more bits for more equality > > IOW, it would possibly make sense to extend the 0/1 result to be > > - bit #0: same file pointer > - bit #1: same path > - bit #2: same dentry > - bit #3: same inode > > which are all different levels of "sameness". Not worth it without someone explaining in detail why imho. First pass should be to try and replace kcmp() in scenarios where it's obviously not needed or overkill. I've added a CLASS(fd_raw) in a preliminary patch since we'll need that anyway which means that your comparison patch becomes even simpler imho. I've also added a selftest patch: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc ?
On Thu, 9 May 2024 at 04:39, Christian Brauner <brauner@kernel.org> wrote: > > Not worth it without someone explaining in detail why imho. First pass > should be to try and replace kcmp() in scenarios where it's obviously > not needed or overkill. Ack. > I've added a CLASS(fd_raw) in a preliminary patch since we'll need that > anyway which means that your comparison patch becomes even simpler imho. > I've also added a selftest patch: > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc LGTM. Maybe worth adding an explicit test for "open same file, but two separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not testing the file on the filesystem for equality, but the file pointer itself". Linus
On Thu, May 09, 2024 at 08:48:20AM -0700, Linus Torvalds wrote: > On Thu, 9 May 2024 at 04:39, Christian Brauner <brauner@kernel.org> wrote: > > > > Not worth it without someone explaining in detail why imho. First pass > > should be to try and replace kcmp() in scenarios where it's obviously > > not needed or overkill. > > Ack. > > > I've added a CLASS(fd_raw) in a preliminary patch since we'll need that > > anyway which means that your comparison patch becomes even simpler imho. > > I've also added a selftest patch: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.misc > > LGTM. > > Maybe worth adding an explicit test for "open same file, but two > separate opens, F_DUPFD_QUERY returns 0? Just to clarify the "it's not > testing the file on the filesystem for equality, but the file pointer > itself". Yep, good point. Added now.
> For the uapi issue you describe below my take would be that we should just > try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe > I'm biased from the gpu world, where we've been hammering it in that > "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid Oh, we're very much on the same page. All new file descriptor types that I've added over the years are O_CLOEXEC by default. IOW, you need to remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new fd type that's added should just be O_CLOEXEC by default.
From: Christian Brauner > Sent: 10 May 2024 11:55 > > > For the uapi issue you describe below my take would be that we should just > > try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe > > I'm biased from the gpu world, where we've been hammering it in that > > "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid > > Oh, we're very much on the same page. All new file descriptor types that > I've added over the years are O_CLOEXEC by default. IOW, you need to > remove O_CLOEXEC explicitly (see pidfd as an example). And imho, any new > fd type that's added should just be O_CLOEXEC by default. For fd a shell redirect creates you may want so be able to say 'this fd will have O_CLOEXEC set after the next exec'. Also (possibly) a flag that can't be cleared once set and that gets kept by dup() etc. But maybe that is excessive? I've certainly used: # ip netns exec ns command 3</sys/class/net in order to be able to (easily) read status for interfaces in the default namespace and a specific namespace. The would be hard if the O_CLOEXEC flag had got set by default. (Especially without a shell builtin to clear it.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c index 882b89edc52a..bffa8083ff36 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -285,6 +285,30 @@ static inline void free_ephead(struct epitems_head *head) kmem_cache_free(ephead_cache, head); } +/* + * The ffd.file pointer may be in the process of + * being torn down due to being closed, but we + * may not have finished eventpoll_release() yet. + * + * Technically, even with the atomic_long_inc_not_zero, + * the file may have been free'd and then gotten + * re-allocated to something else (since files are + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU). + * + * But for epoll, we don't much care. + */ +static struct file *epi_fget(const struct epitem *epi) +{ + struct file *file; + + rcu_read_lock(); + file = epi->ffd.file; + if (!atomic_long_inc_not_zero(&file->f_count)) + file = NULL; + rcu_read_unlock(); + return file; +} + static void list_file(struct file *file) { struct epitems_head *head; @@ -987,14 +1011,18 @@ static __poll_t __ep_eventpoll_poll(struct file *file, poll_table *wait, int dep static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct file *file = epi->ffd.file; + struct file *file = epi_fget(epi); __poll_t res; + if (!file) + return 0; + pt->_key = epi->event.events; if (!is_file_epoll(file)) res = vfs_poll(file, pt); else res = __ep_eventpoll_poll(file, pt, depth); + fput(file); return res & epi->event.events; }