[RFC] Convert ceph_page_mkwrite to use a folio

Message ID	YsbzAoGAAZtlxsrd@casper.infradead.org
State	Superseded
Headers	show Return-Path: <ceph-devel-owner@kernel.org> Date: Thu, 7 Jul 2022 15:51:46 +0100 From: Matthew Wilcox <willy@infradead.org> To: ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: [RFC] Convert ceph_page_mkwrite to use a folio Message-ID: <YsbzAoGAAZtlxsrd@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk
Series	[RFC] Convert ceph_page_mkwrite to use a folio \| expand [RFC] Convert ceph_page_mkwrite to use a folio

Message ID

YsbzAoGAAZtlxsrd@casper.infradead.org

State

Superseded

Headers

Date: Thu, 7 Jul 2022 15:51:46 +0100
From: Matthew Wilcox <willy@infradead.org>
To: ceph-devel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: [RFC] Convert ceph_page_mkwrite to use a folio
Message-ID: <YsbzAoGAAZtlxsrd@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Precedence: bulk

Series

[RFC] Convert ceph_page_mkwrite to use a folio | expand

Commit Message

Matthew Wilcox (Oracle) July 7, 2022, 2:51 p.m. UTC

There are some latent bugs that I fix here (eg, you can't call
thp_size() on a tail page), but the real question is how Ceph in
particular (and FS in general) want to handle mkwrite in a world
of multi-page folios.

If we have a multi-page folio which is occupying an entire PMD, then
no question, we have to mark all 2MB (or whatever) as dirty.  But
if it's being mapped with PTEs, either because it's mapped misaligned,
or it's smaller than a PMD, then we have a choice.  We can either
work in 4kB chunks, marking each one dirty (and storing the sub-folio
dirty state in the fs private data) like a write might.  Or we can
just say "Hey, the whole folio is dirty now" and not try to track
dirtiness on a per-page granularity.

The latter course seems to have been taken, modulo the bugs, but I
don't know if any thought was taken or whether it was done by rote.

Comments

Jeff Layton July 7, 2022, 5:48 p.m. UTC | #1

On Thu, 2022-07-07 at 15:51 +0100, Matthew Wilcox wrote:
> There are some latent bugs that I fix here (eg, you can't call
> thp_size() on a tail page), but the real question is how Ceph in
> particular (and FS in general) want to handle mkwrite in a world
> of multi-page folios.
> 
> If we have a multi-page folio which is occupying an entire PMD, then
> no question, we have to mark all 2MB (or whatever) as dirty.  But
> if it's being mapped with PTEs, either because it's mapped misaligned,
> or it's smaller than a PMD, then we have a choice.  We can either
> work in 4kB chunks, marking each one dirty (and storing the sub-folio
> dirty state in the fs private data) like a write might.  Or we can
> just say "Hey, the whole folio is dirty now" and not try to track
> dirtiness on a per-page granularity.
> 
> The latter course seems to have been taken, modulo the bugs, but I
> don't know if any thought was taken or whether it was done by rote.
> 

Done by rote, I'm pretty sure.

If each individual page retains its own dirty bit, what does
folio_test_dirty return when its pages are only partially dirty? I guess
the folio is still dirty even if some of its pages are clean?

Ceph can do a vectored write if a folio has disjoint dirty regions that
we need to flush. Hashing out an API to handle that with the netfs layer
is going to be "interesting" though.

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 6dee88815491..fb346b929f65 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1503,8 +1503,8 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>  	struct ceph_inode_info *ci = ceph_inode(inode);
>  	struct ceph_file_info *fi = vma->vm_file->private_data;
>  	struct ceph_cap_flush *prealloc_cf;
> -	struct page *page = vmf->page;
> -	loff_t off = page_offset(page);
> +	struct folio *folio = page_folio(vmf->page);
> +	loff_t pos = folio_pos(folio);
>  	loff_t size = i_size_read(inode);
>  	size_t len;
>  	int want, got, err;
> @@ -1521,50 +1521,50 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>  	sb_start_pagefault(inode->i_sb);
>  	ceph_block_sigs(&oldset);
>  
> -	if (off + thp_size(page) <= size)
> -		len = thp_size(page);
> +	if (pos + folio_size(folio) <= size)
> +		len = folio_size(folio);
>  	else
> -		len = offset_in_thp(page, size);
> +		len = offset_in_folio(folio, size);
>  
>  	dout("page_mkwrite %p %llx.%llx %llu~%zd getting caps i_size %llu\n",
> -	     inode, ceph_vinop(inode), off, len, size);
> +	     inode, ceph_vinop(inode), pos, len, size);
>  	if (fi->fmode & CEPH_FILE_MODE_LAZY)
>  		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
>  	else
>  		want = CEPH_CAP_FILE_BUFFER;
>  
>  	got = 0;
> -	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
> +	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, pos + len, &got);
>  	if (err < 0)
>  		goto out_free;
>  
>  	dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
> -	     inode, off, len, ceph_cap_string(got));
> +	     inode, pos, len, ceph_cap_string(got));
>  
> -	/* Update time before taking page lock */
> +	/* Update time before taking folio lock */
>  	file_update_time(vma->vm_file);
>  	inode_inc_iversion_raw(inode);
>  
>  	do {
>  		struct ceph_snap_context *snapc;
>  
> -		lock_page(page);
> +		folio_lock(folio);
>  
> -		if (page_mkwrite_check_truncate(page, inode) < 0) {
> -			unlock_page(page);
> +		if (folio_mkwrite_check_truncate(folio, inode) < 0) {
> +			folio_unlock(folio);
>  			ret = VM_FAULT_NOPAGE;
>  			break;
>  		}
>  
> -		snapc = ceph_find_incompatible(page);
> +		snapc = ceph_find_incompatible(&folio->page);
>  		if (!snapc) {
> -			/* success.  we'll keep the page locked. */
> -			set_page_dirty(page);
> +			/* success.  we'll keep the folio locked. */
> +			folio_mark_dirty(folio);
>  			ret = VM_FAULT_LOCKED;
>  			break;
>  		}
>  
> -		unlock_page(page);
> +		folio_unlock(folio);
>  
>  		if (IS_ERR(snapc)) {
>  			ret = VM_FAULT_SIGBUS;
> @@ -1588,7 +1588,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>  	}
>  
>  	dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
> -	     inode, off, len, ceph_cap_string(got), ret);
> +	     inode, pos, len, ceph_cap_string(got), ret);
>  	ceph_put_cap_refs_async(ci, got);
>  out_free:
>  	ceph_restore_sigs(&oldset);

Matthew Wilcox (Oracle) July 7, 2022, 6:13 p.m. UTC | #2

On Thu, Jul 07, 2022 at 01:48:48PM -0400, Jeff Layton wrote:
> On Thu, 2022-07-07 at 15:51 +0100, Matthew Wilcox wrote:
> > There are some latent bugs that I fix here (eg, you can't call
> > thp_size() on a tail page), but the real question is how Ceph in
> > particular (and FS in general) want to handle mkwrite in a world
> > of multi-page folios.
> > 
> > If we have a multi-page folio which is occupying an entire PMD, then
> > no question, we have to mark all 2MB (or whatever) as dirty.  But
> > if it's being mapped with PTEs, either because it's mapped misaligned,
> > or it's smaller than a PMD, then we have a choice.  We can either
> > work in 4kB chunks, marking each one dirty (and storing the sub-folio
> > dirty state in the fs private data) like a write might.  Or we can
> > just say "Hey, the whole folio is dirty now" and not try to track
> > dirtiness on a per-page granularity.
> > 
> > The latter course seems to have been taken, modulo the bugs, but I
> > don't know if any thought was taken or whether it was done by rote.
> > 
> 
> Done by rote, I'm pretty sure.
> 
> If each individual page retains its own dirty bit, what does
> folio_test_dirty return when its pages are only partially dirty? I guess
> the folio is still dirty even if some of its pages are clean?

ah, each *PTE* retains its own dirty bit.  So we can track which pages
within the folio are actually dirty.  But the folio as a whole has a
single dirty bit (filesystems, of course, are welcome to keep track of
sub-folio dirtiness at whatever granularity makes sense for them).

> Ceph can do a vectored write if a folio has disjoint dirty regions that
> we need to flush. Hashing out an API to handle that with the netfs layer
> is going to be "interesting" though.

Yes; I'm sure Dave has ideas about that ...

> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 6dee88815491..fb346b929f65 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1503,8 +1503,8 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> >  	struct ceph_inode_info *ci = ceph_inode(inode);
> >  	struct ceph_file_info *fi = vma->vm_file->private_data;
> >  	struct ceph_cap_flush *prealloc_cf;
> > -	struct page *page = vmf->page;
> > -	loff_t off = page_offset(page);
> > +	struct folio *folio = page_folio(vmf->page);
> > +	loff_t pos = folio_pos(folio);
> >  	loff_t size = i_size_read(inode);
> >  	size_t len;
> >  	int want, got, err;
> > @@ -1521,50 +1521,50 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> >  	sb_start_pagefault(inode->i_sb);
> >  	ceph_block_sigs(&oldset);
> >  
> > -	if (off + thp_size(page) <= size)
> > -		len = thp_size(page);
> > +	if (pos + folio_size(folio) <= size)
> > +		len = folio_size(folio);
> >  	else
> > -		len = offset_in_thp(page, size);
> > +		len = offset_in_folio(folio, size);
> >  
> >  	dout("page_mkwrite %p %llx.%llx %llu~%zd getting caps i_size %llu\n",
> > -	     inode, ceph_vinop(inode), off, len, size);
> > +	     inode, ceph_vinop(inode), pos, len, size);
> >  	if (fi->fmode & CEPH_FILE_MODE_LAZY)
> >  		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
> >  	else
> >  		want = CEPH_CAP_FILE_BUFFER;
> >  
> >  	got = 0;
> > -	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
> > +	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, pos + len, &got);
> >  	if (err < 0)
> >  		goto out_free;
> >  
> >  	dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
> > -	     inode, off, len, ceph_cap_string(got));
> > +	     inode, pos, len, ceph_cap_string(got));
> >  
> > -	/* Update time before taking page lock */
> > +	/* Update time before taking folio lock */
> >  	file_update_time(vma->vm_file);
> >  	inode_inc_iversion_raw(inode);
> >  
> >  	do {
> >  		struct ceph_snap_context *snapc;
> >  
> > -		lock_page(page);
> > +		folio_lock(folio);
> >  
> > -		if (page_mkwrite_check_truncate(page, inode) < 0) {
> > -			unlock_page(page);
> > +		if (folio_mkwrite_check_truncate(folio, inode) < 0) {
> > +			folio_unlock(folio);
> >  			ret = VM_FAULT_NOPAGE;
> >  			break;
> >  		}
> >  
> > -		snapc = ceph_find_incompatible(page);
> > +		snapc = ceph_find_incompatible(&folio->page);
> >  		if (!snapc) {
> > -			/* success.  we'll keep the page locked. */
> > -			set_page_dirty(page);
> > +			/* success.  we'll keep the folio locked. */
> > +			folio_mark_dirty(folio);
> >  			ret = VM_FAULT_LOCKED;
> >  			break;
> >  		}
> >  
> > -		unlock_page(page);
> > +		folio_unlock(folio);
> >  
> >  		if (IS_ERR(snapc)) {
> >  			ret = VM_FAULT_SIGBUS;
> > @@ -1588,7 +1588,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
> >  	}
> >  
> >  	dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
> > -	     inode, off, len, ceph_cap_string(got), ret);
> > +	     inode, pos, len, ceph_cap_string(got), ret);
> >  	ceph_put_cap_refs_async(ci, got);
> >  out_free:
> >  	ceph_restore_sigs(&oldset);
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

Xiubo Li July 8, 2022, 2:10 a.m. UTC | #3

On 7/8/22 1:48 AM, Jeff Layton wrote:
> On Thu, 2022-07-07 at 15:51 +0100, Matthew Wilcox wrote:
>> There are some latent bugs that I fix here (eg, you can't call
>> thp_size() on a tail page), but the real question is how Ceph in
>> particular (and FS in general) want to handle mkwrite in a world
>> of multi-page folios.
>>
>> If we have a multi-page folio which is occupying an entire PMD, then
>> no question, we have to mark all 2MB (or whatever) as dirty.  But
>> if it's being mapped with PTEs, either because it's mapped misaligned,
>> or it's smaller than a PMD, then we have a choice.  We can either
>> work in 4kB chunks, marking each one dirty (and storing the sub-folio
>> dirty state in the fs private data) like a write might.  Or we can
>> just say "Hey, the whole folio is dirty now" and not try to track
>> dirtiness on a per-page granularity.
>>
>> The latter course seems to have been taken, modulo the bugs, but I
>> don't know if any thought was taken or whether it was done by rote.
>>
> Done by rote, I'm pretty sure.
>
> If each individual page retains its own dirty bit, what does
> folio_test_dirty return when its pages are only partially dirty? I guess
> the folio is still dirty even if some of its pages are clean?
>
> Ceph can do a vectored write if a folio has disjoint dirty regions that
> we need to flush. Hashing out an API to handle that with the netfs layer
> is going to be "interesting" though.

Yeah, sounds reasonable to me.


>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 6dee88815491..fb346b929f65 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -1503,8 +1503,8 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>>   	struct ceph_inode_info *ci = ceph_inode(inode);
>>   	struct ceph_file_info *fi = vma->vm_file->private_data;
>>   	struct ceph_cap_flush *prealloc_cf;
>> -	struct page *page = vmf->page;
>> -	loff_t off = page_offset(page);
>> +	struct folio *folio = page_folio(vmf->page);
>> +	loff_t pos = folio_pos(folio);
>>   	loff_t size = i_size_read(inode);
>>   	size_t len;
>>   	int want, got, err;
>> @@ -1521,50 +1521,50 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>>   	sb_start_pagefault(inode->i_sb);
>>   	ceph_block_sigs(&oldset);
>>   
>> -	if (off + thp_size(page) <= size)
>> -		len = thp_size(page);
>> +	if (pos + folio_size(folio) <= size)
>> +		len = folio_size(folio);
>>   	else
>> -		len = offset_in_thp(page, size);
>> +		len = offset_in_folio(folio, size);
>>   
>>   	dout("page_mkwrite %p %llx.%llx %llu~%zd getting caps i_size %llu\n",
>> -	     inode, ceph_vinop(inode), off, len, size);
>> +	     inode, ceph_vinop(inode), pos, len, size);
>>   	if (fi->fmode & CEPH_FILE_MODE_LAZY)
>>   		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
>>   	else
>>   		want = CEPH_CAP_FILE_BUFFER;
>>   
>>   	got = 0;
>> -	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
>> +	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, pos + len, &got);
>>   	if (err < 0)
>>   		goto out_free;
>>   
>>   	dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
>> -	     inode, off, len, ceph_cap_string(got));
>> +	     inode, pos, len, ceph_cap_string(got));
>>   
>> -	/* Update time before taking page lock */
>> +	/* Update time before taking folio lock */
>>   	file_update_time(vma->vm_file);
>>   	inode_inc_iversion_raw(inode);
>>   
>>   	do {
>>   		struct ceph_snap_context *snapc;
>>   
>> -		lock_page(page);
>> +		folio_lock(folio);
>>   
>> -		if (page_mkwrite_check_truncate(page, inode) < 0) {
>> -			unlock_page(page);
>> +		if (folio_mkwrite_check_truncate(folio, inode) < 0) {
>> +			folio_unlock(folio);
>>   			ret = VM_FAULT_NOPAGE;
>>   			break;
>>   		}
>>   
>> -		snapc = ceph_find_incompatible(page);
>> +		snapc = ceph_find_incompatible(&folio->page);
>>   		if (!snapc) {
>> -			/* success.  we'll keep the page locked. */
>> -			set_page_dirty(page);
>> +			/* success.  we'll keep the folio locked. */
>> +			folio_mark_dirty(folio);
>>   			ret = VM_FAULT_LOCKED;
>>   			break;
>>   		}
>>   
>> -		unlock_page(page);
>> +		folio_unlock(folio);
>>   
>>   		if (IS_ERR(snapc)) {
>>   			ret = VM_FAULT_SIGBUS;
>> @@ -1588,7 +1588,7 @@ static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
>>   	}
>>   
>>   	dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
>> -	     inode, off, len, ceph_cap_string(got), ret);
>> +	     inode, pos, len, ceph_cap_string(got), ret);
>>   	ceph_put_cap_refs_async(ci, got);
>>   out_free:
>>   	ceph_restore_sigs(&oldset);

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6dee88815491..fb346b929f65 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1503,8 +1503,8 @@  static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_file_info *fi = vma->vm_file->private_data;
 	struct ceph_cap_flush *prealloc_cf;
-	struct page *page = vmf->page;
-	loff_t off = page_offset(page);
+	struct folio *folio = page_folio(vmf->page);
+	loff_t pos = folio_pos(folio);
 	loff_t size = i_size_read(inode);
 	size_t len;
 	int want, got, err;
@@ -1521,50 +1521,50 @@  static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
 	sb_start_pagefault(inode->i_sb);
 	ceph_block_sigs(&oldset);
 
-	if (off + thp_size(page) <= size)
-		len = thp_size(page);
+	if (pos + folio_size(folio) <= size)
+		len = folio_size(folio);
 	else
-		len = offset_in_thp(page, size);
+		len = offset_in_folio(folio, size);
 
 	dout("page_mkwrite %p %llx.%llx %llu~%zd getting caps i_size %llu\n",
-	     inode, ceph_vinop(inode), off, len, size);
+	     inode, ceph_vinop(inode), pos, len, size);
 	if (fi->fmode & CEPH_FILE_MODE_LAZY)
 		want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
 	else
 		want = CEPH_CAP_FILE_BUFFER;
 
 	got = 0;
-	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, off + len, &got);
+	err = ceph_get_caps(vma->vm_file, CEPH_CAP_FILE_WR, want, pos + len, &got);
 	if (err < 0)
 		goto out_free;
 
 	dout("page_mkwrite %p %llu~%zd got cap refs on %s\n",
-	     inode, off, len, ceph_cap_string(got));
+	     inode, pos, len, ceph_cap_string(got));
 
-	/* Update time before taking page lock */
+	/* Update time before taking folio lock */
 	file_update_time(vma->vm_file);
 	inode_inc_iversion_raw(inode);
 
 	do {
 		struct ceph_snap_context *snapc;
 
-		lock_page(page);
+		folio_lock(folio);
 
-		if (page_mkwrite_check_truncate(page, inode) < 0) {
-			unlock_page(page);
+		if (folio_mkwrite_check_truncate(folio, inode) < 0) {
+			folio_unlock(folio);
 			ret = VM_FAULT_NOPAGE;
 			break;
 		}
 
-		snapc = ceph_find_incompatible(page);
+		snapc = ceph_find_incompatible(&folio->page);
 		if (!snapc) {
-			/* success.  we'll keep the page locked. */
-			set_page_dirty(page);
+			/* success.  we'll keep the folio locked. */
+			folio_mark_dirty(folio);
 			ret = VM_FAULT_LOCKED;
 			break;
 		}
 
-		unlock_page(page);
+		folio_unlock(folio);
 
 		if (IS_ERR(snapc)) {
 			ret = VM_FAULT_SIGBUS;
@@ -1588,7 +1588,7 @@  static vm_fault_t ceph_page_mkwrite(struct vm_fault *vmf)
 	}
 
 	dout("page_mkwrite %p %llu~%zd dropping cap refs on %s ret %x\n",
-	     inode, off, len, ceph_cap_string(got), ret);
+	     inode, pos, len, ceph_cap_string(got), ret);
 	ceph_put_cap_refs_async(ci, got);
 out_free:
 	ceph_restore_sigs(&oldset);

[RFC] Convert ceph_page_mkwrite to use a folio

Commit Message

Comments

Patch