Message ID | 163967169723.1823006.2868573008412053995.stgit@warthog.procyon.org.uk |
---|---|
State | New |
Headers | show |
Series | fscache, cachefiles: Rewrite | expand |
On Thu, Dec 16, 2021 at 8:22 AM David Howells <dhowells@redhat.com> wrote: > > With transparent huge pages, in the future, write_begin() and write_end() > may be passed a length parameter that, in combination with the offset into > the page, exceeds the length of that page. This allows > grab_cache_page_write_begin() to better choose the size of THP to allocate. I still think this is a fundamental bug in the caller. That "explanation" is weak, and the whole concept smells like week-old fish to me. Linus
Linus Torvalds <torvalds@linux-foundation.org> wrote: > > With transparent huge pages, in the future, write_begin() and write_end() > > may be passed a length parameter that, in combination with the offset into > > the page, exceeds the length of that page. This allows > > grab_cache_page_write_begin() to better choose the size of THP to allocate. > > I still think this is a fundamental bug in the caller. That > "explanation" is weak, and the whole concept smells like week-old fish > to me. You really should ask Willy about this as it's multipage folio-related. AIUI, because the page/folio may be allocated inside ->write_begin(), generic_perform_write() tells the filesystem how much it has been asked to write and then the folio allocation can be made to fit that. However, at this time, ->write_begin() and ->write_end() have a page pointer (or pointer-to-pointer), not a folio pointer, in their signature, so the filesystem has to convert between them. I'm working on write helpers for netfslib that absorb this out of the filesystems that use it into its own take on generic_perform_write(), thereby eliminating the need for ->write_begin and ->write_end. I have this kind of working for afs and 9p at the moment and am looking at ceph, but there's a way to go yet. I believe iomap does the same for block-based filesystems that use it (such as xfs). I think Willy's aim is to get rid of ->write_begin and ->write_end entirely. David
On Thu, Dec 16, 2021 at 08:31:19AM -0800, Linus Torvalds wrote: > On Thu, Dec 16, 2021 at 8:22 AM David Howells <dhowells@redhat.com> wrote: > > > > With transparent huge pages, in the future, write_begin() and write_end() > > may be passed a length parameter that, in combination with the offset into > > the page, exceeds the length of that page. This allows > > grab_cache_page_write_begin() to better choose the size of THP to allocate. > > I still think this is a fundamental bug in the caller. That > "explanation" is weak, and the whole concept smells like week-old fish > to me. You're right that ->write_end can't be called with more bytes than fit in the folio. That makes no sense at all. I haven't finished fully fleshing this out yet (and as a result we still only create PAGE_SIZE folios on writes), but my basic plan was: generic_perform_write: - bytes = min_t(unsigned long, PAGE_SIZE - offset, + bytes = min_t(unsigned long, FOLIO_MAX_PAGE_CACHE_SIZE - offset, iov_iter_count(i)); ... status = a_ops->write_begin(file, mapping, pos, bytes, flags, - &page, &fsdata); + &folio, &fsdata); + offset = offset_in_folio(folio, pos); + bytes = folio_size(folio - offset); ... status = a_ops->write_end(file, mapping, pos, bytes, copied, - page, fsdata); + folio, fsdata); Since ->write_begin is the place where we actually create folios, it needs to know what size folio to create. Unless you'd rather we do something to actually create the folio before calling ->write_begin?
On Thu, Dec 16, 2021 at 11:28 AM Matthew Wilcox <willy@infradead.org> wrote: > > Since ->write_begin is the place where we actually create folios, it > needs to know what size folio to create. Unless you'd rather we do > something to actually create the folio before calling ->write_begin? I don't think we can create a folio before that, because the filesystem may not even want a folio (think persistent memory or whatever). Honestly, I think you need to describe more what you actually want to happen. Because generic_perform_write() has already decided to use a PAGE_SIZE by the time write_begin() is called, Right now the world order is "we chunk things by PAGE_SIZE", and that's just how it is. I can see other options - like the filesystem passing in the chunk size when it calls generic_perform_write(). Or we make the rule be that ->write_begin() simply always is given the whole area, and the filesystem can decide how it wants to chunk things up, and return the size of the write chunk in the status (rather than the current "success or error"). But at no point will this *EVER* be a "afs will limit the size to the folio size" issue. Nothing like that will ever make sense. Allowing bigger chunks will not be about any fscache issues, it will be about every single filesystem that uses generic_perform_write(). So I will NAK these patches by David, because they are fundamentally wrong, whichever way we turn. Any "write in bigger chunks" patch will be something else entirely. Linus
On Thu, Dec 16, 2021 at 11:46:18AM -0800, Linus Torvalds wrote: > On Thu, Dec 16, 2021 at 11:28 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > Since ->write_begin is the place where we actually create folios, it > > needs to know what size folio to create. Unless you'd rather we do > > something to actually create the folio before calling ->write_begin? > > I don't think we can create a folio before that, because the > filesystem may not even want a folio (think persistent memory or > whatever). > > Honestly, I think you need to describe more what you actually want to > happen. Because generic_perform_write() has already decided to use a > PAGE_SIZE by the time write_begin() is called, > > Right now the world order is "we chunk things by PAGE_SIZE", and > that's just how it is. Right. And we could leave it like that. There's a huge amount of win that comes from just creating large folios as part of readahead, and anything we do for writes is going to be a smaller win. That said, I would like it if a program which does: fd = creat("foo", 0644); write(fd, buf, 64 * 1024); close(fd); uses a single 64k page. > I can see other options - like the filesystem passing in the chunk > size when it calls generic_perform_write(). I'm hoping to avoid that. Ideally filesystems don't know what the "chunk size" is that's being used; they'll see a mixture of sizes being used for any given file (potentially). Depends on access patterns, availability of higher-order memory, etc. > Or we make the rule be that ->write_begin() simply always is given the > whole area, and the filesystem can decide how it wants to chunk things > up, and return the size of the write chunk in the status (rather than > the current "success or error"). We do need to be slightly more limiting than "always gets the whole area", because we do that fault_in_iov_iter_readable() call first, and if the user has been mean and asked to write() 2GB of memory on a (virtual) machine with 256MB, I'd prefer it if we didn't swap our way through 2GB of address space before calling into ->write_begin. > But at no point will this *EVER* be a "afs will limit the size to the > folio size" issue. Nothing like that will ever make sense. Allowing > bigger chunks will not be about any fscache issues, it will be about > every single filesystem that uses generic_perform_write(). I agree that there should be nothing here that is specific to fscache. David has in the past tried to convince me that he should always get 256kB folios, and I've done my best to explain that the MM just isn't going to make that guarantee. That said, this patch seems to be doing the right thing; it passes the entire length into netfs_write_begin(), and is then truncating the length to stop at the end of the folio that it got back.
Linus Torvalds <torvalds@linux-foundation.org> wrote: > So I will NAK these patches by David, because they are fundamentally > wrong, whichever way we turn. Any "write in bigger chunks" patch will > be something else entirely. I'll just drop patches 56 and 57 for now, then. I think the problem only manifests if I set the flag saying the filesystem/inode/whatever supports multipage folios. David
diff --git a/fs/afs/write.c b/fs/afs/write.c index ca4909baf5e6..8e4e87d66855 100644 --- a/fs/afs/write.c +++ b/fs/afs/write.c @@ -25,7 +25,8 @@ int afs_set_page_dirty(struct page *page) } /* - * prepare to perform part of a write to a page + * Prepare to perform part of a write to a page. Note that len may extend + * beyond the end of the page. */ int afs_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, @@ -36,7 +37,6 @@ int afs_write_begin(struct file *file, struct address_space *mapping, unsigned long priv; unsigned f, from; unsigned t, to; - pgoff_t index; int ret; _enter("{%llx:%llu},%llx,%x", @@ -51,8 +51,8 @@ int afs_write_begin(struct file *file, struct address_space *mapping, if (ret < 0) return ret; - index = folio_index(folio); - from = pos - index * PAGE_SIZE; + from = offset_in_folio(folio, pos); + len = min_t(size_t, len, folio_size(folio) - from); to = from + len; try_again: @@ -103,7 +103,8 @@ int afs_write_begin(struct file *file, struct address_space *mapping, } /* - * finalise part of a write to a page + * Finalise part of a write to a page. Note that len may extend beyond the end + * of the page. */ int afs_write_end(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied,