Message ID | 20250523161522.409504-4-mhklinux@outlook.com |
---|---|
State | New |
Headers | show |
Series | fbdev: Add deferred I/O support for contiguous kernel memory framebuffers | expand |
Hi, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on linus/master v6.15-rc7 next-20250523] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/mhkelley58-gmail-com/mm-Export-vmf_insert_mixed_mkwrite/20250524-001707 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20250523161522.409504-4-mhklinux%40outlook.com patch subject: [PATCH v3 3/4] fbdev/deferred-io: Support contiguous kernel memory framebuffers config: arm-randconfig-001-20250524 (https://download.01.org/0day-ci/archive/20250524/202505241553.VSXoFOvX-lkp@intel.com/config) compiler: arm-linux-gnueabi-gcc (GCC) 7.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250524/202505241553.VSXoFOvX-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202505241553.VSXoFOvX-lkp@intel.com/ All errors (new ones prefixed by >>, old ones prefixed by <<): >> ERROR: modpost: "vmf_insert_pfn" [drivers/video/fbdev/core/fb.ko] undefined! >> ERROR: modpost: "vmf_insert_mixed_mkwrite" [drivers/video/fbdev/core/fb.ko] undefined!
On Fri, May 23, 2025 at 09:15:21AM -0700, mhkelley58@gmail.com wrote: > Commit 37b4837959cb ("video: deferred io with physically contiguous > memory") from the year 2008 purported to add support for contiguous > kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses > dma_alloc_coherent() to allocate framebuffer memory, which is likely to > use alloc_pages(). It's unclear to me how this commit actually worked at > the time, unless dma_alloc_coherent() was pulling from a CMA pool instead > of alloc_pages(). Or perhaps alloc_pages() worked differently or on the > arm32 architecture on which sh_mobile_lcdcfb is used. > > In any case, for x86 and arm64 today, commit 37b4837959cb9 is not > sufficient to support contiguous kernel memory framebuffers. The problem > can be seen with the hyperv_fb driver, which may allocate the framebuffer > memory using vmalloc() or alloc_pages(), depending on the configuration > of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer. That langugage is far too nice. The existing users of fb_defio are all gravely broken because they violate the dma API restriction to not poke into the memory. You can't speculate what you get from dma_alloc_coherent and it can change behind you all the time. > Fix this limitation by adding defio support for contiguous kernel memory > framebuffers. A driver with a framebuffer allocated from contiguous > kernel memory must set the FBINFO_KMEMFB flag to indicate such. Honestly, the right thing is to invert the flag. What hypervs is doing here - take kernel memory in the direct mapping or from vmalloc is fine. What others drivers do it completely broken crap. So add a flag FBINFO_BROKEN_CRAP to maybe keep the guessing. Or just disable it because it's dangerous.
On 23.05.25 18:15, mhkelley58@gmail.com wrote: > From: Michael Kelley <mhklinux@outlook.com> > > Current defio code works only for framebuffer memory that is allocated > with vmalloc(). The code assumes that the underlying page refcount can > be used by the mm subsystem to manage each framebuffer page's lifecycle, > including freeing the page if the refcount goes to 0. This approach is > consistent with vmalloc'ed memory, but not with contiguous kernel memory > allocated via alloc_pages() or similar. The latter such memory pages > usually have a refcount of 0 when allocated, and would be incorrectly > freed page-by-page if used with defio. That free'ing corrupts the memory > free lists and Linux eventually panics. Simply bumping the refcount after > allocation doesn’t work because when the framebuffer memory is freed, > __free_pages() complains about non-zero refcounts. > > Commit 37b4837959cb ("video: deferred io with physically contiguous > memory") from the year 2008 purported to add support for contiguous > kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses > dma_alloc_coherent() to allocate framebuffer memory, which is likely to > use alloc_pages(). It's unclear to me how this commit actually worked at > the time, unless dma_alloc_coherent() was pulling from a CMA pool instead > of alloc_pages(). Or perhaps alloc_pages() worked differently or on the > arm32 architecture on which sh_mobile_lcdcfb is used. > > In any case, for x86 and arm64 today, commit 37b4837959cb9 is not > sufficient to support contiguous kernel memory framebuffers. The problem > can be seen with the hyperv_fb driver, which may allocate the framebuffer > memory using vmalloc() or alloc_pages(), depending on the configuration > of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer. > > Fix this limitation by adding defio support for contiguous kernel memory > framebuffers. A driver with a framebuffer allocated from contiguous > kernel memory must set the FBINFO_KMEMFB flag to indicate such. > > Tested with the hyperv_fb driver in both configurations -- with a vmalloc() > framebuffer and with an alloc_pages() framebuffer on x86. Also verified a > vmalloc() framebuffer on arm64. Hardware is not available to me to verify > that the older arm32 devices still work correctly, but the path for > vmalloc() framebuffers is essentially unchanged. > > Even with these changes, defio does not support framebuffers in MMIO > space, as defio code depends on framebuffer memory pages having > corresponding 'struct page's. > > Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on HyperV Gen 1 VMs.") > Signed-off-by: Michael Kelley <mhklinux@outlook.com> > --- > Changes in v3: > * Moved definition of FBINFO_KMEMFB flag to a separate patch > preceeding this one in the patch set [Helge Deller] > Changes in v2: > * Tweaked code comments regarding framebuffers allocated with > dma_alloc_coherent() [Christoph Hellwig] > > drivers/video/fbdev/core/fb_defio.c | 128 +++++++++++++++++++++++----- > 1 file changed, 108 insertions(+), 20 deletions(-) > > diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c > index 4fc93f253e06..f8ae91a1c4df 100644 > --- a/drivers/video/fbdev/core/fb_defio.c > +++ b/drivers/video/fbdev/core/fb_defio.c > @@ -8,11 +8,40 @@ > * for more details. > */ > > +/* > + * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user space > + * to batch user space writes into periodic updates to the underlying > + * framebuffer hardware or other implementation (such as with a virtualized > + * framebuffer in a VM). At each batch interval, a callback is invoked in the > + * framebuffer's kernel driver, and the callback is supplied with a list of > + * pages that have been modified in the preceding interval. The callback can > + * use this information to update the framebuffer hardware as necessary. The > + * batching can improve performance and reduce the overhead of updating the > + * hardware. > + * > + * Defio is supported on framebuffers allocated using vmalloc() and allocated > + * as contiguous kernel memory using alloc_pages() or kmalloc(). These > + * memory allocations all have corresponding "struct page"s. Framebuffers > + * allocated using dma_alloc_coherent() should not be used with defio. > + * Such allocations should be treated as a black box owned by the DMA > + * layer, and should not be deconstructed into individual pages as defio > + * does. Framebuffers in MMIO space are *not* supported because MMIO space > + * does not have corrresponding "struct page"s. > + * > + * For framebuffers allocated using vmalloc(), struct fb_info must have > + * "screen_buffer" set to the vmalloc address of the framebuffer. For > + * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB must > + * be set, and "fix.smem_start" must be set to the physical address of the > + * frame buffer. In both cases, "fix.smem_len" must be set to the framebuffer > + * size in bytes. > + */ > + > #include <linux/module.h> > #include <linux/kernel.h> > #include <linux/errno.h> > #include <linux/string.h> > #include <linux/mm.h> > +#include <linux/pfn_t.h> > #include <linux/vmalloc.h> > #include <linux/delay.h> > #include <linux/interrupt.h> > @@ -37,7 +66,7 @@ static struct page *fb_deferred_io_get_page(struct fb_info *info, unsigned long > else if (info->fix.smem_start) > page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT); > > - if (page) > + if (page && !(info->flags & FBINFO_KMEMFB)) > get_page(page); > > return page; > @@ -137,6 +166,15 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) > > BUG_ON(!info->fbdefio->mapping); > > + if (info->flags & FBINFO_KMEMFB) > + /* > + * In this path, the VMA is marked VM_PFNMAP, so mm assumes > + * there is no struct page associated with the page. The > + * PFN must be directly inserted and the created PTE will be > + * marked "special". > + */ > + return vmf_insert_pfn(vmf->vma, vmf->address, page_to_pfn(page)); > + > vmf->page = page; > return 0; > } > @@ -163,13 +201,14 @@ EXPORT_SYMBOL_GPL(fb_deferred_io_fsync); > > /* > * Adds a page to the dirty list. Call this from struct > - * vm_operations_struct.page_mkwrite. > + * vm_operations_struct.page_mkwrite or .pfn_mkwrite. > */ > -static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long offset, > +static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct vm_fault *vmf, > struct page *page) > { > struct fb_deferred_io *fbdefio = info->fbdefio; > struct fb_deferred_io_pageref *pageref; > + unsigned long offset = vmf->pgoff << PAGE_SHIFT; > vm_fault_t ret; > > /* protect against the workqueue changing the page list */ > @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long > } > > /* > - * We want the page to remain locked from ->page_mkwrite until > - * the PTE is marked dirty to avoid mapping_wrprotect_range() > - * being called before the PTE is updated, which would leave > - * the page ignored by defio. > - * Do this by locking the page here and informing the caller > - * about it with VM_FAULT_LOCKED. > + * The PTE must be marked writable before the defio deferred work runs > + * again and potentially marks the PTE write-protected. If the order > + * should be switched, the PTE would become writable without defio > + * tracking the page, leaving the page forever ignored by defio. > + * > + * For vmalloc() framebuffers, the associated struct page is locked > + * before releasing the defio lock. mm will later mark the PTE writaable > + * and release the struct page lock. The struct page lock prevents > + * the page from being prematurely being marked write-protected. > + * > + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, > + * so the PTE must be marked writable while the defio lock is held. > */ > - lock_page(pageref->page); > + if (info->flags & FBINFO_KMEMFB) { > + unsigned long pfn = page_to_pfn(pageref->page); > + > + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, > + __pfn_to_pfn_t(pfn, PFN_SPECIAL)); Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a horrible hack. In another thread, you mention that you use PFN_SPECIAL to bypass the check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? It's all a mess ...
From: David Hildenbrand <david@redhat.com> Sent: Monday, June 2, 2025 2:48 AM > > On 23.05.25 18:15, mhkelley58@gmail.com wrote: > > From: Michael Kelley <mhklinux@outlook.com> > > > > Current defio code works only for framebuffer memory that is allocated > > with vmalloc(). The code assumes that the underlying page refcount can > > be used by the mm subsystem to manage each framebuffer page's lifecycle, > > including freeing the page if the refcount goes to 0. This approach is > > consistent with vmalloc'ed memory, but not with contiguous kernel memory > > allocated via alloc_pages() or similar. The latter such memory pages > > usually have a refcount of 0 when allocated, and would be incorrectly > > freed page-by-page if used with defio. That free'ing corrupts the memory > > free lists and Linux eventually panics. Simply bumping the refcount after > > allocation doesn’t work because when the framebuffer memory is freed, > > __free_pages() complains about non-zero refcounts. > > > > Commit 37b4837959cb ("video: deferred io with physically contiguous > > memory") from the year 2008 purported to add support for contiguous > > kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses > > dma_alloc_coherent() to allocate framebuffer memory, which is likely to > > use alloc_pages(). It's unclear to me how this commit actually worked at > > the time, unless dma_alloc_coherent() was pulling from a CMA pool instead > > of alloc_pages(). Or perhaps alloc_pages() worked differently or on the > > arm32 architecture on which sh_mobile_lcdcfb is used. > > > > In any case, for x86 and arm64 today, commit 37b4837959cb9 is not > > sufficient to support contiguous kernel memory framebuffers. The problem > > can be seen with the hyperv_fb driver, which may allocate the framebuffer > > memory using vmalloc() or alloc_pages(), depending on the configuration > > of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer. > > > > Fix this limitation by adding defio support for contiguous kernel memory > > framebuffers. A driver with a framebuffer allocated from contiguous > > kernel memory must set the FBINFO_KMEMFB flag to indicate such. > > > > Tested with the hyperv_fb driver in both configurations -- with a vmalloc() > > framebuffer and with an alloc_pages() framebuffer on x86. Also verified a > > vmalloc() framebuffer on arm64. Hardware is not available to me to verify > > that the older arm32 devices still work correctly, but the path for > > vmalloc() framebuffers is essentially unchanged. > > > > Even with these changes, defio does not support framebuffers in MMIO > > space, as defio code depends on framebuffer memory pages having > > corresponding 'struct page's. > > > > Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on HyperV Gen 1 VMs.") > > Signed-off-by: Michael Kelley <mhklinux@outlook.com> > > --- > > Changes in v3: > > * Moved definition of FBINFO_KMEMFB flag to a separate patch > > preceeding this one in the patch set [Helge Deller] > > Changes in v2: > > * Tweaked code comments regarding framebuffers allocated with > > dma_alloc_coherent() [Christoph Hellwig] > > > > drivers/video/fbdev/core/fb_defio.c | 128 +++++++++++++++++++++++----- > > 1 file changed, 108 insertions(+), 20 deletions(-) > > > > diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c > > index 4fc93f253e06..f8ae91a1c4df 100644 > > --- a/drivers/video/fbdev/core/fb_defio.c > > +++ b/drivers/video/fbdev/core/fb_defio.c > > @@ -8,11 +8,40 @@ > > * for more details. > > */ > > > > +/* > > + * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user space > > + * to batch user space writes into periodic updates to the underlying > > + * framebuffer hardware or other implementation (such as with a virtualized > > + * framebuffer in a VM). At each batch interval, a callback is invoked in the > > + * framebuffer's kernel driver, and the callback is supplied with a list of > > + * pages that have been modified in the preceding interval. The callback can > > + * use this information to update the framebuffer hardware as necessary. The > > + * batching can improve performance and reduce the overhead of updating the > > + * hardware. > > + * > > + * Defio is supported on framebuffers allocated using vmalloc() and allocated > > + * as contiguous kernel memory using alloc_pages() or kmalloc(). These > > + * memory allocations all have corresponding "struct page"s. Framebuffers > > + * allocated using dma_alloc_coherent() should not be used with defio. > > + * Such allocations should be treated as a black box owned by the DMA > > + * layer, and should not be deconstructed into individual pages as defio > > + * does. Framebuffers in MMIO space are *not* supported because MMIO space > > + * does not have corrresponding "struct page"s. > > + * > > + * For framebuffers allocated using vmalloc(), struct fb_info must have > > + * "screen_buffer" set to the vmalloc address of the framebuffer. For > > + * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB must > > + * be set, and "fix.smem_start" must be set to the physical address of the > > + * frame buffer. In both cases, "fix.smem_len" must be set to the framebuffer > > + * size in bytes. > > + */ > > + > > #include <linux/module.h> > > #include <linux/kernel.h> > > #include <linux/errno.h> > > #include <linux/string.h> > > #include <linux/mm.h> > > +#include <linux/pfn_t.h> > > #include <linux/vmalloc.h> > > #include <linux/delay.h> > > #include <linux/interrupt.h> > > @@ -37,7 +66,7 @@ static struct page *fb_deferred_io_get_page(struct fb_info *info, unsigned long > > else if (info->fix.smem_start) > > page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT); > > > > - if (page) > > + if (page && !(info->flags & FBINFO_KMEMFB)) > > get_page(page); > > > > return page; > > @@ -137,6 +166,15 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) > > > > BUG_ON(!info->fbdefio->mapping); > > > > + if (info->flags & FBINFO_KMEMFB) > > + /* > > + * In this path, the VMA is marked VM_PFNMAP, so mm assumes > > + * there is no struct page associated with the page. The > > + * PFN must be directly inserted and the created PTE will be > > + * marked "special". > > + */ > > + return vmf_insert_pfn(vmf->vma, vmf->address, page_to_pfn(page)); > > + > > vmf->page = page; > > return 0; > > } > > @@ -163,13 +201,14 @@ EXPORT_SYMBOL_GPL(fb_deferred_io_fsync); > > > > /* > > * Adds a page to the dirty list. Call this from struct > > - * vm_operations_struct.page_mkwrite. > > + * vm_operations_struct.page_mkwrite or .pfn_mkwrite. > > */ > > -static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long offset, > > +static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct vm_fault *vmf, > > struct page *page) > > { > > struct fb_deferred_io *fbdefio = info->fbdefio; > > struct fb_deferred_io_pageref *pageref; > > + unsigned long offset = vmf->pgoff << PAGE_SHIFT; > > vm_fault_t ret; > > > > /* protect against the workqueue changing the page list */ > > @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long > > } > > > > /* > > - * We want the page to remain locked from ->page_mkwrite until > > - * the PTE is marked dirty to avoid mapping_wrprotect_range() > > - * being called before the PTE is updated, which would leave > > - * the page ignored by defio. > > - * Do this by locking the page here and informing the caller > > - * about it with VM_FAULT_LOCKED. > > + * The PTE must be marked writable before the defio deferred work runs > > + * again and potentially marks the PTE write-protected. If the order > > + * should be switched, the PTE would become writable without defio > > + * tracking the page, leaving the page forever ignored by defio. > > + * > > + * For vmalloc() framebuffers, the associated struct page is locked > > + * before releasing the defio lock. mm will later mark the PTE writaable > > + * and release the struct page lock. The struct page lock prevents > > + * the page from being prematurely being marked write-protected. > > + * > > + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, > > + * so the PTE must be marked writable while the defio lock is held. > > */ > > - lock_page(pageref->page); > > + if (info->flags & FBINFO_KMEMFB) { > > + unsigned long pfn = page_to_pfn(pageref->page); > > + > > + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, > > + __pfn_to_pfn_t(pfn, PFN_SPECIAL)); > > Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a > horrible hack. > > In another thread, you mention that you use PFN_SPECIAL to bypass the > check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's a wrong impression. vm_mixed_ok() does a thorough job of validating the use of __vm_insert_mixed(), and since what I did was allowed, I thought perhaps it was OK. Your feedback has set me straight, and that's what I needed. :-) But the whole approach is moot with Alistair Popple's patch set that eliminates pfn_t. Is there an existing mm API that will do mkwrite on a special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed it. If there's not one, I'll take a crack at adding it in the next version of my patch set. Michael
Hi Am 03.06.25 um 03:49 schrieb Michael Kelley: [...] >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a >> horrible hack. >> >> In another thread, you mention that you use PFN_SPECIAL to bypass the >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? > The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's > a wrong impression. vm_mixed_ok() does a thorough job of validating the > use of __vm_insert_mixed(), and since what I did was allowed, I thought > perhaps it was OK. Your feedback has set me straight, and that's what I > needed. :-) > > But the whole approach is moot with Alistair Popple's patch set that > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed > it. If there's not one, I'll take a crack at adding it in the next version of my > patch set. What is the motivation behind this work? The driver or fbdev as a whole does not have much of a future anyway. I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm? Best regards Thomas > > Michael
On 03.06.25 03:49, Michael Kelley wrote: > From: David Hildenbrand <david@redhat.com> Sent: Monday, June 2, 2025 2:48 AM >> >> On 23.05.25 18:15, mhkelley58@gmail.com wrote: >>> From: Michael Kelley <mhklinux@outlook.com> >>> >>> Current defio code works only for framebuffer memory that is allocated >>> with vmalloc(). The code assumes that the underlying page refcount can >>> be used by the mm subsystem to manage each framebuffer page's lifecycle, >>> including freeing the page if the refcount goes to 0. This approach is >>> consistent with vmalloc'ed memory, but not with contiguous kernel memory >>> allocated via alloc_pages() or similar. The latter such memory pages >>> usually have a refcount of 0 when allocated, and would be incorrectly >>> freed page-by-page if used with defio. That free'ing corrupts the memory >>> free lists and Linux eventually panics. Simply bumping the refcount after >>> allocation doesn’t work because when the framebuffer memory is freed, >>> __free_pages() complains about non-zero refcounts. >>> >>> Commit 37b4837959cb ("video: deferred io with physically contiguous >>> memory") from the year 2008 purported to add support for contiguous >>> kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses >>> dma_alloc_coherent() to allocate framebuffer memory, which is likely to >>> use alloc_pages(). It's unclear to me how this commit actually worked at >>> the time, unless dma_alloc_coherent() was pulling from a CMA pool instead >>> of alloc_pages(). Or perhaps alloc_pages() worked differently or on the >>> arm32 architecture on which sh_mobile_lcdcfb is used. >>> >>> In any case, for x86 and arm64 today, commit 37b4837959cb9 is not >>> sufficient to support contiguous kernel memory framebuffers. The problem >>> can be seen with the hyperv_fb driver, which may allocate the framebuffer >>> memory using vmalloc() or alloc_pages(), depending on the configuration >>> of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer. >>> >>> Fix this limitation by adding defio support for contiguous kernel memory >>> framebuffers. A driver with a framebuffer allocated from contiguous >>> kernel memory must set the FBINFO_KMEMFB flag to indicate such. >>> >>> Tested with the hyperv_fb driver in both configurations -- with a vmalloc() >>> framebuffer and with an alloc_pages() framebuffer on x86. Also verified a >>> vmalloc() framebuffer on arm64. Hardware is not available to me to verify >>> that the older arm32 devices still work correctly, but the path for >>> vmalloc() framebuffers is essentially unchanged. >>> >>> Even with these changes, defio does not support framebuffers in MMIO >>> space, as defio code depends on framebuffer memory pages having >>> corresponding 'struct page's. >>> >>> Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on HyperV Gen 1 VMs.") >>> Signed-off-by: Michael Kelley <mhklinux@outlook.com> >>> --- >>> Changes in v3: >>> * Moved definition of FBINFO_KMEMFB flag to a separate patch >>> preceeding this one in the patch set [Helge Deller] >>> Changes in v2: >>> * Tweaked code comments regarding framebuffers allocated with >>> dma_alloc_coherent() [Christoph Hellwig] >>> >>> drivers/video/fbdev/core/fb_defio.c | 128 +++++++++++++++++++++++----- >>> 1 file changed, 108 insertions(+), 20 deletions(-) >>> >>> diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c >>> index 4fc93f253e06..f8ae91a1c4df 100644 >>> --- a/drivers/video/fbdev/core/fb_defio.c >>> +++ b/drivers/video/fbdev/core/fb_defio.c >>> @@ -8,11 +8,40 @@ >>> * for more details. >>> */ >>> >>> +/* >>> + * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user space >>> + * to batch user space writes into periodic updates to the underlying >>> + * framebuffer hardware or other implementation (such as with a virtualized >>> + * framebuffer in a VM). At each batch interval, a callback is invoked in the >>> + * framebuffer's kernel driver, and the callback is supplied with a list of >>> + * pages that have been modified in the preceding interval. The callback can >>> + * use this information to update the framebuffer hardware as necessary. The >>> + * batching can improve performance and reduce the overhead of updating the >>> + * hardware. >>> + * >>> + * Defio is supported on framebuffers allocated using vmalloc() and allocated >>> + * as contiguous kernel memory using alloc_pages() or kmalloc(). These >>> + * memory allocations all have corresponding "struct page"s. Framebuffers >>> + * allocated using dma_alloc_coherent() should not be used with defio. >>> + * Such allocations should be treated as a black box owned by the DMA >>> + * layer, and should not be deconstructed into individual pages as defio >>> + * does. Framebuffers in MMIO space are *not* supported because MMIO space >>> + * does not have corrresponding "struct page"s. >>> + * >>> + * For framebuffers allocated using vmalloc(), struct fb_info must have >>> + * "screen_buffer" set to the vmalloc address of the framebuffer. For >>> + * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB must >>> + * be set, and "fix.smem_start" must be set to the physical address of the >>> + * frame buffer. In both cases, "fix.smem_len" must be set to the framebuffer >>> + * size in bytes. >>> + */ >>> + >>> #include <linux/module.h> >>> #include <linux/kernel.h> >>> #include <linux/errno.h> >>> #include <linux/string.h> >>> #include <linux/mm.h> >>> +#include <linux/pfn_t.h> >>> #include <linux/vmalloc.h> >>> #include <linux/delay.h> >>> #include <linux/interrupt.h> >>> @@ -37,7 +66,7 @@ static struct page *fb_deferred_io_get_page(struct fb_info *info, unsigned long >>> else if (info->fix.smem_start) >>> page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT); >>> >>> - if (page) >>> + if (page && !(info->flags & FBINFO_KMEMFB)) >>> get_page(page); >>> >>> return page; >>> @@ -137,6 +166,15 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) >>> >>> BUG_ON(!info->fbdefio->mapping); >>> >>> + if (info->flags & FBINFO_KMEMFB) >>> + /* >>> + * In this path, the VMA is marked VM_PFNMAP, so mm assumes >>> + * there is no struct page associated with the page. The >>> + * PFN must be directly inserted and the created PTE will be >>> + * marked "special". >>> + */ >>> + return vmf_insert_pfn(vmf->vma, vmf->address, page_to_pfn(page)); >>> + >>> vmf->page = page; >>> return 0; >>> } >>> @@ -163,13 +201,14 @@ EXPORT_SYMBOL_GPL(fb_deferred_io_fsync); >>> >>> /* >>> * Adds a page to the dirty list. Call this from struct >>> - * vm_operations_struct.page_mkwrite. >>> + * vm_operations_struct.page_mkwrite or .pfn_mkwrite. >>> */ >>> -static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long offset, >>> +static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct vm_fault *vmf, >>> struct page *page) >>> { >>> struct fb_deferred_io *fbdefio = info->fbdefio; >>> struct fb_deferred_io_pageref *pageref; >>> + unsigned long offset = vmf->pgoff << PAGE_SHIFT; >>> vm_fault_t ret; >>> >>> /* protect against the workqueue changing the page list */ >>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long >>> } >>> >>> /* >>> - * We want the page to remain locked from ->page_mkwrite until >>> - * the PTE is marked dirty to avoid mapping_wrprotect_range() >>> - * being called before the PTE is updated, which would leave >>> - * the page ignored by defio. >>> - * Do this by locking the page here and informing the caller >>> - * about it with VM_FAULT_LOCKED. >>> + * The PTE must be marked writable before the defio deferred work runs >>> + * again and potentially marks the PTE write-protected. If the order >>> + * should be switched, the PTE would become writable without defio >>> + * tracking the page, leaving the page forever ignored by defio. >>> + * >>> + * For vmalloc() framebuffers, the associated struct page is locked >>> + * before releasing the defio lock. mm will later mark the PTE writaable >>> + * and release the struct page lock. The struct page lock prevents >>> + * the page from being prematurely being marked write-protected. >>> + * >>> + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, >>> + * so the PTE must be marked writable while the defio lock is held. >>> */ >>> - lock_page(pageref->page); >>> + if (info->flags & FBINFO_KMEMFB) { >>> + unsigned long pfn = page_to_pfn(pageref->page); >>> + >>> + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, >>> + __pfn_to_pfn_t(pfn, PFN_SPECIAL)); >> >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a >> horrible hack. >> >> In another thread, you mention that you use PFN_SPECIAL to bypass the >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's > a wrong impression. VM_PFNMAP: nothing is refcounted except anon pages VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted pte_special() is a way for GUP-fast to distinguish these refcounted (can GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any locks or the VMA being available. Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page" (pfn_valid()) is likely very bogus. > vm_mixed_ok() does a thorough job of validating the > use of __vm_insert_mixed(), and since what I did was allowed, I thought > perhaps it was OK. Your feedback has set me straight, and that's what I > needed. :-) What exactly are you trying to achieve? :) If it's mapping a page with a "struct page" and *not* refcounting it, then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP mapping. It will set pte_special() automatically for you. > > But the whole approach is moot with Alistair Popple's patch set that > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed > it. If there's not one, I'll take a crack at adding it in the next version of my > patch set. I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe :) )
From: David Hildenbrand <david@redhat.com> Sent: Tuesday, June 3, 2025 12:55 AM > > On 03.06.25 03:49, Michael Kelley wrote: > > From: David Hildenbrand <david@redhat.com> Sent: Monday, June 2, 2025 2:48 AM > >> > >> On 23.05.25 18:15, mhkelley58@gmail.com wrote: > >>> From: Michael Kelley <mhklinux@outlook.com> > >>> > >>> Current defio code works only for framebuffer memory that is allocated > >>> with vmalloc(). The code assumes that the underlying page refcount can > >>> be used by the mm subsystem to manage each framebuffer page's lifecycle, > >>> including freeing the page if the refcount goes to 0. This approach is > >>> consistent with vmalloc'ed memory, but not with contiguous kernel memory > >>> allocated via alloc_pages() or similar. The latter such memory pages > >>> usually have a refcount of 0 when allocated, and would be incorrectly > >>> freed page-by-page if used with defio. That free'ing corrupts the memory > >>> free lists and Linux eventually panics. Simply bumping the refcount after > >>> allocation doesn’t work because when the framebuffer memory is freed, > >>> __free_pages() complains about non-zero refcounts. > >>> > >>> Commit 37b4837959cb ("video: deferred io with physically contiguous > >>> memory") from the year 2008 purported to add support for contiguous > >>> kernel memory framebuffers. The motivating device, sh_mobile_lcdcfb, uses > >>> dma_alloc_coherent() to allocate framebuffer memory, which is likely to > >>> use alloc_pages(). It's unclear to me how this commit actually worked at > >>> the time, unless dma_alloc_coherent() was pulling from a CMA pool instead > >>> of alloc_pages(). Or perhaps alloc_pages() worked differently or on the > >>> arm32 architecture on which sh_mobile_lcdcfb is used. > >>> > >>> In any case, for x86 and arm64 today, commit 37b4837959cb9 is not > >>> sufficient to support contiguous kernel memory framebuffers. The problem > >>> can be seen with the hyperv_fb driver, which may allocate the framebuffer > >>> memory using vmalloc() or alloc_pages(), depending on the configuration > >>> of the Hyper-V guest VM (Gen 1 vs. Gen 2) and the size of the framebuffer. > >>> > >>> Fix this limitation by adding defio support for contiguous kernel memory > >>> framebuffers. A driver with a framebuffer allocated from contiguous > >>> kernel memory must set the FBINFO_KMEMFB flag to indicate such. > >>> > >>> Tested with the hyperv_fb driver in both configurations -- with a vmalloc() > >>> framebuffer and with an alloc_pages() framebuffer on x86. Also verified a > >>> vmalloc() framebuffer on arm64. Hardware is not available to me to verify > >>> that the older arm32 devices still work correctly, but the path for > >>> vmalloc() framebuffers is essentially unchanged. > >>> > >>> Even with these changes, defio does not support framebuffers in MMIO > >>> space, as defio code depends on framebuffer memory pages having > >>> corresponding 'struct page's. > >>> > >>> Fixes: 3a6fb6c4255c ("video: hyperv: hyperv_fb: Use physical memory for fb on HyperV > Gen 1 VMs.") > >>> Signed-off-by: Michael Kelley <mhklinux@outlook.com> > >>> --- > >>> Changes in v3: > >>> * Moved definition of FBINFO_KMEMFB flag to a separate patch > >>> preceeding this one in the patch set [Helge Deller] > >>> Changes in v2: > >>> * Tweaked code comments regarding framebuffers allocated with > >>> dma_alloc_coherent() [Christoph Hellwig] > >>> > >>> drivers/video/fbdev/core/fb_defio.c | 128 +++++++++++++++++++++++----- > >>> 1 file changed, 108 insertions(+), 20 deletions(-) > >>> > >>> diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c > >>> index 4fc93f253e06..f8ae91a1c4df 100644 > >>> --- a/drivers/video/fbdev/core/fb_defio.c > >>> +++ b/drivers/video/fbdev/core/fb_defio.c > >>> @@ -8,11 +8,40 @@ > >>> * for more details. > >>> */ > >>> > >>> +/* > >>> + * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user space > >>> + * to batch user space writes into periodic updates to the underlying > >>> + * framebuffer hardware or other implementation (such as with a virtualized > >>> + * framebuffer in a VM). At each batch interval, a callback is invoked in the > >>> + * framebuffer's kernel driver, and the callback is supplied with a list of > >>> + * pages that have been modified in the preceding interval. The callback can > >>> + * use this information to update the framebuffer hardware as necessary. The > >>> + * batching can improve performance and reduce the overhead of updating the > >>> + * hardware. > >>> + * > >>> + * Defio is supported on framebuffers allocated using vmalloc() and allocated > >>> + * as contiguous kernel memory using alloc_pages() or kmalloc(). These > >>> + * memory allocations all have corresponding "struct page"s. Framebuffers > >>> + * allocated using dma_alloc_coherent() should not be used with defio. > >>> + * Such allocations should be treated as a black box owned by the DMA > >>> + * layer, and should not be deconstructed into individual pages as defio > >>> + * does. Framebuffers in MMIO space are *not* supported because MMIO space > >>> + * does not have corrresponding "struct page"s. > >>> + * > >>> + * For framebuffers allocated using vmalloc(), struct fb_info must have > >>> + * "screen_buffer" set to the vmalloc address of the framebuffer. For > >>> + * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB must > >>> + * be set, and "fix.smem_start" must be set to the physical address of the > >>> + * frame buffer. In both cases, "fix.smem_len" must be set to the framebuffer > >>> + * size in bytes. > >>> + */ > >>> + > >>> #include <linux/module.h> > >>> #include <linux/kernel.h> > >>> #include <linux/errno.h> > >>> #include <linux/string.h> > >>> #include <linux/mm.h> > >>> +#include <linux/pfn_t.h> > >>> #include <linux/vmalloc.h> > >>> #include <linux/delay.h> > >>> #include <linux/interrupt.h> > >>> @@ -37,7 +66,7 @@ static struct page *fb_deferred_io_get_page(struct fb_info *info, unsigned long > >>> else if (info->fix.smem_start) > >>> page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT); > >>> > >>> - if (page) > >>> + if (page && !(info->flags & FBINFO_KMEMFB)) > >>> get_page(page); > >>> > >>> return page; > >>> @@ -137,6 +166,15 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) > >>> > >>> BUG_ON(!info->fbdefio->mapping); > >>> > >>> + if (info->flags & FBINFO_KMEMFB) > >>> + /* > >>> + * In this path, the VMA is marked VM_PFNMAP, so mm assumes > >>> + * there is no struct page associated with the page. The > >>> + * PFN must be directly inserted and the created PTE will be > >>> + * marked "special". > >>> + */ > >>> + return vmf_insert_pfn(vmf->vma, vmf->address, page_to_pfn(page)); > >>> + > >>> vmf->page = page; > >>> return 0; > >>> } > >>> @@ -163,13 +201,14 @@ EXPORT_SYMBOL_GPL(fb_deferred_io_fsync); > >>> > >>> /* > >>> * Adds a page to the dirty list. Call this from struct > >>> - * vm_operations_struct.page_mkwrite. > >>> + * vm_operations_struct.page_mkwrite or .pfn_mkwrite. > >>> */ > >>> -static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long offset, > >>> +static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct vm_fault *vmf, > >>> struct page *page) > >>> { > >>> struct fb_deferred_io *fbdefio = info->fbdefio; > >>> struct fb_deferred_io_pageref *pageref; > >>> + unsigned long offset = vmf->pgoff << PAGE_SHIFT; > >>> vm_fault_t ret; > >>> > >>> /* protect against the workqueue changing the page list */ > >>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long > >>> } > >>> > >>> /* > >>> - * We want the page to remain locked from ->page_mkwrite until > >>> - * the PTE is marked dirty to avoid mapping_wrprotect_range() > >>> - * being called before the PTE is updated, which would leave > >>> - * the page ignored by defio. > >>> - * Do this by locking the page here and informing the caller > >>> - * about it with VM_FAULT_LOCKED. > >>> + * The PTE must be marked writable before the defio deferred work runs > >>> + * again and potentially marks the PTE write-protected. If the order > >>> + * should be switched, the PTE would become writable without defio > >>> + * tracking the page, leaving the page forever ignored by defio. > >>> + * > >>> + * For vmalloc() framebuffers, the associated struct page is locked > >>> + * before releasing the defio lock. mm will later mark the PTE writaable > >>> + * and release the struct page lock. The struct page lock prevents > >>> + * the page from being prematurely being marked write-protected. > >>> + * > >>> + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, > >>> + * so the PTE must be marked writable while the defio lock is held. > >>> */ > >>> - lock_page(pageref->page); > >>> + if (info->flags & FBINFO_KMEMFB) { > >>> + unsigned long pfn = page_to_pfn(pageref->page); > >>> + > >>> + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, > >>> + __pfn_to_pfn_t(pfn, PFN_SPECIAL)); > >> > >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a > >> horrible hack. > >> > >> In another thread, you mention that you use PFN_SPECIAL to bypass the > >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? > > > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like > > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's > > a wrong impression. > > VM_PFNMAP: nothing is refcounted except anon pages > > VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted > > pte_special() is a way for GUP-fast to distinguish these refcounted (can > GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any > locks or the VMA being available. > > Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page" > (pfn_valid()) is likely very bogus. OK, good to know. > > > vm_mixed_ok() does a thorough job of validating the > > use of __vm_insert_mixed(), and since what I did was allowed, I thought > > perhaps it was OK. Your feedback has set me straight, and that's what I > > needed. :-) > > What exactly are you trying to achieve? :) > > If it's mapping a page with a "struct page" and *not* refcounting it, > then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP > mapping. It will set pte_special() automatically for you. > Yes, that's what I'm using to initially create the special PTE in the .fault callback. > > > > But the whole approach is moot with Alistair Popple's patch set that > > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a > > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed > > it. If there's not one, I'll take a crack at adding it in the next version of my > > patch set. > > I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably > vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe > :) ) Ok, I'll look at that more closely. The sequence is that the special PTE gets created with vmf_insert_pfn(). Then when the page is first written to, the .pfn_mkwrite callback is invoked by mm. The question is the best way for that callback to mark the existing PTE as writable. Thanks, Michael
From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Monday, June 2, 2025 11:25 PM > > Hi > > Am 03.06.25 um 03:49 schrieb Michael Kelley: > [...] > >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a > >> horrible hack. > >> > >> In another thread, you mention that you use PFN_SPECIAL to bypass the > >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like > > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's > > a wrong impression. vm_mixed_ok() does a thorough job of validating the > > use of __vm_insert_mixed(), and since what I did was allowed, I thought > > perhaps it was OK. Your feedback has set me straight, and that's what I > > needed. :-) > > > > But the whole approach is moot with Alistair Popple's patch set that > > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a > > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed > > it. If there's not one, I'll take a crack at adding it in the next version of my > > patch set. > > What is the motivation behind this work? The driver or fbdev as a whole > does not have much of a future anyway. > > I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm? > Yes, I think that's the longer term direction. A couple months ago I had an email conversation with Saurabh Sengar from the Microsoft Linux team where he raised this idea. I think the Microsoft folks will need to drive the deprecation process, as they need to coordinate with the distro vendors who publish images for running on local Hyper-V and in the Azure cloud. And my understanding is that the Linux kernel process would want the driver to be available but marked "deprecated" for a year or so before it actually goes away. I do have some concerns about the maturity of the hyperv_drm driver "around the edges". For example, somebody just recently submitted a patch to flush output on panic. I have less familiarity hyperv_drm vs. hyperv_fb, so some of my concern is probably due to that. We might need to do review of hyperv_drm and see if there's anything else to deal with before hyperv_fb goes away. This all got started when I was looking at a problem with hyperv_fb, and I found several other related problems, some of which also existed in hyperv_drm. You've seen several small'ish fixes from me and Saurabh as a result, and this issue with mmap()'ing /dev/fb0 is the last one of that set. This fix is definitely a bit bigger, but it's the right fix. On the flip side, if we really get on a path to deprecate hyperv_fb, there are hack fixes for the mmap problem that are smaller and contained to hyperv_fb. I would be OK with a hack fix in that case. Michael
Hi Am 03.06.25 um 19:50 schrieb Michael Kelley: > From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Monday, June 2, 2025 11:25 PM >> Hi >> >> Am 03.06.25 um 03:49 schrieb Michael Kelley: >> [...] >>>> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a >>>> horrible hack. >>>> >>>> In another thread, you mention that you use PFN_SPECIAL to bypass the >>>> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? >>> The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like >>> VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's >>> a wrong impression. vm_mixed_ok() does a thorough job of validating the >>> use of __vm_insert_mixed(), and since what I did was allowed, I thought >>> perhaps it was OK. Your feedback has set me straight, and that's what I >>> needed. :-) >>> >>> But the whole approach is moot with Alistair Popple's patch set that >>> eliminates pfn_t. Is there an existing mm API that will do mkwrite on a >>> special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed >>> it. If there's not one, I'll take a crack at adding it in the next version of my >>> patch set. >> What is the motivation behind this work? The driver or fbdev as a whole >> does not have much of a future anyway. >> >> I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm? >> > Yes, I think that's the longer term direction. A couple months ago I had an > email conversation with Saurabh Sengar from the Microsoft Linux team where > he raised this idea. I think the Microsoft folks will need to drive the deprecation > process, as they need to coordinate with the distro vendors who publish > images for running on local Hyper-V and in the Azure cloud. And my > understanding is that the Linux kernel process would want the driver to > be available but marked "deprecated" for a year or so before it actually > goes away. We (DRM upstream) recently considered moving some fbdev drivers to drivers/staging or marking them with !DRM if a DRM driver is available. Hyverv_fb would be a candidate. At least at SUSE, we ship hypervdrm instead of hyperv_fb. This works well on the various generations of the hyperv system. Much of our userspace would not be able to use hyperv_fb anyway. > > I do have some concerns about the maturity of the hyperv_drm driver > "around the edges". For example, somebody just recently submitted a > patch to flush output on panic. I have less familiarity hyperv_drm vs. > hyperv_fb, so some of my concern is probably due to that. We might > need to do review of hyperv_drm and see if there's anything else to > deal with before hyperv_fb goes away. The panic output is a feature that we recently added to the kernel. It allows a DRM driver to display a final error message in the case of a kernel panic (think of blue screens on Windows). Drivers require a minimum of support to make it work. That's what the hypervdrm patches were about. Best regards Thomas > > This all got started when I was looking at a problem with hyperv_fb, > and I found several other related problems, some of which also existed > in hyperv_drm. You've seen several small'ish fixes from me and Saurabh > as a result, and this issue with mmap()'ing /dev/fb0 is the last one of that > set. This fix is definitely a bit bigger, but it's the right fix. On the flip side, > if we really get on a path to deprecate hyperv_fb, there are hack fixes for > the mmap problem that are smaller and contained to hyperv_fb. I would > be OK with a hack fix in that case. > > Michael
On Wed, Jun 04, 2025 at 10:12:45AM +0200, Thomas Zimmermann wrote: > Hi > > Am 03.06.25 um 19:50 schrieb Michael Kelley: > > From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Monday, June 2, 2025 11:25 PM > > > Hi > > > > > > Am 03.06.25 um 03:49 schrieb Michael Kelley: > > > [...] > > > > > Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a > > > > > horrible hack. > > > > > > > > > > In another thread, you mention that you use PFN_SPECIAL to bypass the > > > > > check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? > > > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like > > > > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's > > > > a wrong impression. vm_mixed_ok() does a thorough job of validating the > > > > use of __vm_insert_mixed(), and since what I did was allowed, I thought > > > > perhaps it was OK. Your feedback has set me straight, and that's what I > > > > needed. :-) > > > > > > > > But the whole approach is moot with Alistair Popple's patch set that > > > > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a > > > > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed > > > > it. If there's not one, I'll take a crack at adding it in the next version of my > > > > patch set. > > > What is the motivation behind this work? The driver or fbdev as a whole > > > does not have much of a future anyway. > > > > > > I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm? > > > > > Yes, I think that's the longer term direction. A couple months ago I had an > > email conversation with Saurabh Sengar from the Microsoft Linux team where > > he raised this idea. I think the Microsoft folks will need to drive the deprecation > > process, as they need to coordinate with the distro vendors who publish > > images for running on local Hyper-V and in the Azure cloud. And my > > understanding is that the Linux kernel process would want the driver to > > be available but marked "deprecated" for a year or so before it actually > > goes away. > > We (DRM upstream) recently considered moving some fbdev drivers to > drivers/staging or marking them with !DRM if a DRM driver is available. > Hyverv_fb would be a candidate. > > At least at SUSE, we ship hypervdrm instead of hyperv_fb. This works well on > the various generations of the hyperv system. Much of our userspace would > not be able to use hyperv_fb anyway. Yeah investing into fbdev drivers, especially when some mm surgery seems needed, does not sound like a good idea to me overall. > > I do have some concerns about the maturity of the hyperv_drm driver > > "around the edges". For example, somebody just recently submitted a > > patch to flush output on panic. I have less familiarity hyperv_drm vs. > > hyperv_fb, so some of my concern is probably due to that. We might > > need to do review of hyperv_drm and see if there's anything else to > > deal with before hyperv_fb goes away. > > The panic output is a feature that we recently added to the kernel. It > allows a DRM driver to display a final error message in the case of a kernel > panic (think of blue screens on Windows). Drivers require a minimum of > support to make it work. That's what the hypervdrm patches were about. I'm also happy to help with any other issues and shortfalls of drm vs fbdev. There are some, but I thought it was mostly around some of the low bit color formats that really old devices want, and not anything that hyperv would need. Cheers, Sima > > Best regards > Thomas > > > > > This all got started when I was looking at a problem with hyperv_fb, > > and I found several other related problems, some of which also existed > > in hyperv_drm. You've seen several small'ish fixes from me and Saurabh > > as a result, and this issue with mmap()'ing /dev/fb0 is the last one of that > > set. This fix is definitely a bit bigger, but it's the right fix. On the flip side, > > if we really get on a path to deprecate hyperv_fb, there are hack fixes for > > the mmap problem that are smaller and contained to hyperv_fb. I would > > be OK with a hack fix in that case. > > > > Michael > > -- > -- > Thomas Zimmermann > Graphics Driver Developer > SUSE Software Solutions Germany GmbH > Frankenstrasse 146, 90461 Nuernberg, Germany > GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman > HRB 36809 (AG Nuernberg) >
From: Simona Vetter <simona.vetter@ffwll.ch> Sent: Wednesday, June 4, 2025 7:46 AM > > On Wed, Jun 04, 2025 at 10:12:45AM +0200, Thomas Zimmermann wrote: > > Hi > > > > Am 03.06.25 um 19:50 schrieb Michael Kelley: > > > From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Monday, June 2, 2025 11:25 PM > > > > Hi > > > > > > > > Am 03.06.25 um 03:49 schrieb Michael Kelley: > > > > [...] > > > > What is the motivation behind this work? The driver or fbdev as a whole > > > > does not have much of a future anyway. > > > > > > > > I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm? > > > > > > > Yes, I think that's the longer term direction. A couple months ago I had an > > > email conversation with Saurabh Sengar from the Microsoft Linux team where > > > he raised this idea. I think the Microsoft folks will need to drive the deprecation > > > process, as they need to coordinate with the distro vendors who publish > > > images for running on local Hyper-V and in the Azure cloud. And my > > > understanding is that the Linux kernel process would want the driver to > > > be available but marked "deprecated" for a year or so before it actually > > > goes away. > > > > We (DRM upstream) recently considered moving some fbdev drivers to > > drivers/staging or marking them with !DRM if a DRM driver is available. > > Hyverv_fb would be a candidate. > > > > At least at SUSE, we ship hypervdrm instead of hyperv_fb. This works well on > > the various generations of the hyperv system. Much of our userspace would > > not be able to use hyperv_fb anyway. Good to know. Red Hat has made the switch as well. The Ubuntu images in Azure have both hyperv_fb and hyperv_drm. I don't know what other distros have done. > > Yeah investing into fbdev drivers, especially when some mm surgery seems > needed, does not sound like a good idea to me overall. > > > > I do have some concerns about the maturity of the hyperv_drm driver > > > "around the edges". For example, somebody just recently submitted a > > > patch to flush output on panic. I have less familiarity hyperv_drm vs. > > > hyperv_fb, so some of my concern is probably due to that. We might > > > need to do review of hyperv_drm and see if there's anything else to > > > deal with before hyperv_fb goes away. > > > > The panic output is a feature that we recently added to the kernel. It > > allows a DRM driver to display a final error message in the case of a kernel > > panic (think of blue screens on Windows). Drivers require a minimum of > > support to make it work. That's what the hypervdrm patches were about. > > I'm also happy to help with any other issues and shortfalls of drm vs > fbdev. There are some, but I thought it was mostly around some of the low > bit color formats that really old devices want, and not anything that > hyperv would need. You've set me up perfectly to raise an issue. :-) I'm still relatively new to the hyperv_drm driver and DRM in general, compared with hyperv_fb. One capability in fbdev is deferred I/O, which is what this entire patch series is about. The hyperv_drm driver doesn't currently use anything similar to deferred I/O like hyperv_fb. I don't know if that's because hyperv_drm doesn't make use of what DRM has to offer, or if DRM doesn't have a deferred I/O framework like fbdev. Do you know what the situation is? Or could you point me to an example of doing deferred I/O with DRM that hyperv_drm should be following? I ran a quick performance test comparing hyperv_drm with hyperv_fb. The test does "cat" of a big text file in the Hyper-V graphics console. The file has 1024 * 1024 lines, each with 64 characters, so total file size is 64 MiB. With hyperv_fb the test completes in 24 seconds elapsed time, with 24 seconds of system CPU time. With hyperv_drm, it takes 34 seconds elapsed time, but with about the same 24 seconds of system CPU time. Overall this difference isn't huge, and probably isn't that noticeable when doing human-scale work (i.e., 'dmesg' outputting several hundred lines in 0.19 seconds vs. my test doing 1M lines) on the Hyper-V graphics console. To me, the console doesn't feel slow with hyperv_drm compared to hyperv_fb, which is good. Nonetheless, there's an underlying issue. A main cause of the difference is the number of messages to Hyper-V to update dirty regions. With hyperv_fb using deferred I/O, the messages are limited 20/second, so the total number of messages to Hyper-V is about 480. But hyperv_drm appears to send 3 messages to Hyper-V for each line of output, or a total of about 3,000,000 messages (~90K/second). That's a lot of additional load on the Hyper-V host, and it adds the 10 seconds of additional elapsed time seen in the guest. There also this ugly output in dmesg because the ring buffer for sending messages to the Hyper-V host gets full -- Hyper-V doesn't always keep up, at least not on my local laptop where I'm testing: [12574.327615] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 [12574.327684] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 [12574.327760] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 [12574.327841] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 [12597.016128] hyperv_sendpacket: 6211 callbacks suppressed [12597.016133] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 [12597.016172] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 [12597.016220] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 [12597.016267] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 hyperv_drm could be fixed to not output the ugly messages, but there's still the underlying issue of overrunning the ring buffer, and excessively hammering on the host. If we could get hyperv_drm doing deferred I/O, I would feel much better about going full-on with deprecating hyperv_fb. Michael
From: Michael Kelley <mhklinux@outlook.com> Sent: Tuesday, June 3, 2025 10:25 AM > > From: David Hildenbrand <david@redhat.com> Sent: Tuesday, June 3, 2025 12:55 AM > > > > On 03.06.25 03:49, Michael Kelley wrote: > > > From: David Hildenbrand <david@redhat.com> Sent: Monday, June 2, 2025 2:48 AM > > >> [snip] > > >>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long > > >>> } > > >>> > > >>> /* > > >>> - * We want the page to remain locked from ->page_mkwrite until > > >>> - * the PTE is marked dirty to avoid mapping_wrprotect_range() > > >>> - * being called before the PTE is updated, which would leave > > >>> - * the page ignored by defio. > > >>> - * Do this by locking the page here and informing the caller > > >>> - * about it with VM_FAULT_LOCKED. > > >>> + * The PTE must be marked writable before the defio deferred work runs > > >>> + * again and potentially marks the PTE write-protected. If the order > > >>> + * should be switched, the PTE would become writable without defio > > >>> + * tracking the page, leaving the page forever ignored by defio. > > >>> + * > > >>> + * For vmalloc() framebuffers, the associated struct page is locked > > >>> + * before releasing the defio lock. mm will later mark the PTE writaable > > >>> + * and release the struct page lock. The struct page lock prevents > > >>> + * the page from being prematurely being marked write-protected. > > >>> + * > > >>> + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, > > >>> + * so the PTE must be marked writable while the defio lock is held. > > >>> */ > > >>> - lock_page(pageref->page); > > >>> + if (info->flags & FBINFO_KMEMFB) { > > >>> + unsigned long pfn = page_to_pfn(pageref->page); > > >>> + > > >>> + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, > > >>> + __pfn_to_pfn_t(pfn, PFN_SPECIAL)); > > >> > > >> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a > > >> horrible hack. > > >> > > >> In another thread, you mention that you use PFN_SPECIAL to bypass the > > >> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? > > > > > > The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like > > > VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's > > > a wrong impression. > > > > VM_PFNMAP: nothing is refcounted except anon pages > > > > VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted > > > > pte_special() is a way for GUP-fast to distinguish these refcounted (can > > GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any > > locks or the VMA being available. > > > > Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page" > > (pfn_valid()) is likely very bogus. > > OK, good to know. > > > > > > vm_mixed_ok() does a thorough job of validating the > > > use of __vm_insert_mixed(), and since what I did was allowed, I thought > > > perhaps it was OK. Your feedback has set me straight, and that's what I > > > needed. :-) > > > > What exactly are you trying to achieve? :) > > > > If it's mapping a page with a "struct page" and *not* refcounting it, > > then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP > > mapping. It will set pte_special() automatically for you. > > > > Yes, that's what I'm using to initially create the special PTE in the > .fault callback. > > > > > > > But the whole approach is moot with Alistair Popple's patch set that > > > eliminates pfn_t. Is there an existing mm API that will do mkwrite on a > > > special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed > > > it. If there's not one, I'll take a crack at adding it in the next version of my > > > patch set. > > > > I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably > > vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe > > :) ) > > Ok, I'll look at that more closely. The sequence is that the special > PTE gets created with vmf_insert_pfn(). Then when the page is first > written to, the .pfn_mkwrite callback is invoked by mm. The question > is the best way for that callback to mark the existing PTE as writable. > FWIW, vmf_insert_pfn_prot() won't work. It calls insert_pfn() with the "mkwrite" parameter set to 'false', in which case insert_pfn() does nothing if the PTE already exists. So I would need to create a new API that does appropriate validation for a VM_PFNMAP VMA, and then calls insert_pfn() with the "mkwrite" parameter set to 'true'. Michael
Hi Am 04.06.25 um 23:43 schrieb Michael Kelley: > From: Simona Vetter <simona.vetter@ffwll.ch> Sent: Wednesday, June 4, 2025 7:46 AM >> On Wed, Jun 04, 2025 at 10:12:45AM +0200, Thomas Zimmermann wrote: >>> Hi >>> >>> Am 03.06.25 um 19:50 schrieb Michael Kelley: >>>> From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Monday, June 2, 2025 11:25 PM >>>>> Hi >>>>> >>>>> Am 03.06.25 um 03:49 schrieb Michael Kelley: >>>>> [...] >>>>> What is the motivation behind this work? The driver or fbdev as a whole >>>>> does not have much of a future anyway. >>>>> >>>>> I'd like to suggest removing hyperv_fb entirely in favor of hypervdrm? >>>>> >>>> Yes, I think that's the longer term direction. A couple months ago I had an >>>> email conversation with Saurabh Sengar from the Microsoft Linux team where >>>> he raised this idea. I think the Microsoft folks will need to drive the deprecation >>>> process, as they need to coordinate with the distro vendors who publish >>>> images for running on local Hyper-V and in the Azure cloud. And my >>>> understanding is that the Linux kernel process would want the driver to >>>> be available but marked "deprecated" for a year or so before it actually >>>> goes away. >>> We (DRM upstream) recently considered moving some fbdev drivers to >>> drivers/staging or marking them with !DRM if a DRM driver is available. >>> Hyverv_fb would be a candidate. >>> >>> At least at SUSE, we ship hypervdrm instead of hyperv_fb. This works well on >>> the various generations of the hyperv system. Much of our userspace would >>> not be able to use hyperv_fb anyway. > Good to know. Red Hat has made the switch as well. The Ubuntu images > in Azure have both hyperv_fb and hyperv_drm. I don't know what other > distros have done. > >> Yeah investing into fbdev drivers, especially when some mm surgery seems >> needed, does not sound like a good idea to me overall. >> >>>> I do have some concerns about the maturity of the hyperv_drm driver >>>> "around the edges". For example, somebody just recently submitted a >>>> patch to flush output on panic. I have less familiarity hyperv_drm vs. >>>> hyperv_fb, so some of my concern is probably due to that. We might >>>> need to do review of hyperv_drm and see if there's anything else to >>>> deal with before hyperv_fb goes away. >>> The panic output is a feature that we recently added to the kernel. It >>> allows a DRM driver to display a final error message in the case of a kernel >>> panic (think of blue screens on Windows). Drivers require a minimum of >>> support to make it work. That's what the hypervdrm patches were about. >> I'm also happy to help with any other issues and shortfalls of drm vs >> fbdev. There are some, but I thought it was mostly around some of the low >> bit color formats that really old devices want, and not anything that >> hyperv would need. > You've set me up perfectly to raise an issue. :-) I'm still relatively new > to the hyperv_drm driver and DRM in general, compared with hyperv_fb. > One capability in fbdev is deferred I/O, which is what this entire patch > series is about. The hyperv_drm driver doesn't currently use anything > similar to deferred I/O like hyperv_fb. I don't know if that's because > hyperv_drm doesn't make use of what DRM has to offer, or if DRM doesn't > have a deferred I/O framework like fbdev. Do you know what the situation > is? Or could you point me to an example of doing deferred I/O with DRM > that hyperv_drm should be following? Fbdev deferred I/O is a workaround for the fact that fbdev does not require a flush operation on its I/O buffers. Writing to an mmaped buffer is expected to go to hardware immediately. On devices where this is not the case, deferred I/O tracks written pages and writes them back to hardware at intervals. For DRM, there's the MODE_DIRTYFB ioctl [1] that all userspace has to call after writing to mmap'ed buffers. So regular DRM doesn't need deferred I/O as userspace triggers writeback explicitly. [1] https://elixir.bootlin.com/linux/v6.15/source/drivers/gpu/drm/drm_ioctl.c#L686 > > I ran a quick performance test comparing hyperv_drm with hyperv_fb. > The test does "cat" of a big text file in the Hyper-V graphics console. The > file has 1024 * 1024 lines, each with 64 characters, so total file size is > 64 MiB. > > With hyperv_fb the test completes in 24 seconds elapsed time, with > 24 seconds of system CPU time. With hyperv_drm, it takes 34 seconds > elapsed time, but with about the same 24 seconds of system CPU time. > Overall this difference isn't huge, and probably isn't that noticeable > when doing human-scale work (i.e., 'dmesg' outputting several > hundred lines in 0.19 seconds vs. my test doing 1M lines) on the Hyper-V > graphics console. To me, the console doesn't feel slow with hyperv_drm > compared to hyperv_fb, which is good. DRM consoles are technically an fbdev device that operates on a DRM device. Both, DRM and fbdev, have some differences that can make this problematic. I'm not surprised that there are issues. > > Nonetheless, there's an underlying issue. A main cause of the difference > is the number of messages to Hyper-V to update dirty regions. With > hyperv_fb using deferred I/O, the messages are limited 20/second, so > the total number of messages to Hyper-V is about 480. But hyperv_drm > appears to send 3 messages to Hyper-V for each line of output, or a total of > about 3,000,000 messages (~90K/second). That's a lot of additional load > on the Hyper-V host, and it adds the 10 seconds of additional elapsed > time seen in the guest. There also this ugly output in dmesg because the > ring buffer for sending messages to the Hyper-V host gets full -- Hyper-V > doesn't always keep up, at least not on my local laptop where I'm > testing: > > [12574.327615] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12574.327684] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12574.327760] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12574.327841] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016128] hyperv_sendpacket: 6211 callbacks suppressed > [12597.016133] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016172] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016220] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016267] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > > hyperv_drm could be fixed to not output the ugly messages, but there's > still the underlying issue of overrunning the ring buffer, and excessively > hammering on the host. If we could get hyperv_drm doing deferred I/O, I > would feel much better about going full-on with deprecating hyperv_fb. Thanks for debugging this. A number of things are playing into this. - DRM performs display output along vblank IRQs. For example, if the display runs with 60 Hz there should be no more than 60 display updates per second. From what I can tell, there's no IRQ support in hypervdrm (or HyperV in general?). Without IRQ support, drivers output to hardware ASAP, which can result in large numbers of buffer updates per second. I've heard about this problem in other context [2] and you're likely seeing a similar issue. - DRM's console also needs better support for vblank interrupts. It currently sends out updates ASAP as well. Both points are not much of a problem on most desktop and server systems, but can be an be an issue with virtualization. [2] https://bugzilla.suse.com/show_bug.cgi?id=1189174 Best regards Thomas > > Michael >
On 04.06.25 23:58, Michael Kelley wrote: > From: Michael Kelley <mhklinux@outlook.com> Sent: Tuesday, June 3, 2025 10:25 AM >> >> From: David Hildenbrand <david@redhat.com> Sent: Tuesday, June 3, 2025 12:55 AM >>> >>> On 03.06.25 03:49, Michael Kelley wrote: >>>> From: David Hildenbrand <david@redhat.com> Sent: Monday, June 2, 2025 2:48 AM >>>>> > > [snip] > >>>>>> @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long >>>>>> } >>>>>> >>>>>> /* >>>>>> - * We want the page to remain locked from ->page_mkwrite until >>>>>> - * the PTE is marked dirty to avoid mapping_wrprotect_range() >>>>>> - * being called before the PTE is updated, which would leave >>>>>> - * the page ignored by defio. >>>>>> - * Do this by locking the page here and informing the caller >>>>>> - * about it with VM_FAULT_LOCKED. >>>>>> + * The PTE must be marked writable before the defio deferred work runs >>>>>> + * again and potentially marks the PTE write-protected. If the order >>>>>> + * should be switched, the PTE would become writable without defio >>>>>> + * tracking the page, leaving the page forever ignored by defio. >>>>>> + * >>>>>> + * For vmalloc() framebuffers, the associated struct page is locked >>>>>> + * before releasing the defio lock. mm will later mark the PTE writaable >>>>>> + * and release the struct page lock. The struct page lock prevents >>>>>> + * the page from being prematurely being marked write-protected. >>>>>> + * >>>>>> + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, >>>>>> + * so the PTE must be marked writable while the defio lock is held. >>>>>> */ >>>>>> - lock_page(pageref->page); >>>>>> + if (info->flags & FBINFO_KMEMFB) { >>>>>> + unsigned long pfn = page_to_pfn(pageref->page); >>>>>> + >>>>>> + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, >>>>>> + __pfn_to_pfn_t(pfn, PFN_SPECIAL)); >>>>> >>>>> Will the VMA have VM_PFNMAP or VM_MIXEDMAP set? PFN_SPECIAL is a >>>>> horrible hack. >>>>> >>>>> In another thread, you mention that you use PFN_SPECIAL to bypass the >>>>> check in vm_mixed_ok(), so VM_MIXEDMAP is likely not set? >>>> >>>> The VMA has VM_PFNMAP set, not VM_MIXEDMAP. It seemed like >>>> VM_MIXEDMAP is somewhat of a superset of VM_PFNMAP, but maybe that's >>>> a wrong impression. >>> >>> VM_PFNMAP: nothing is refcounted except anon pages >>> >>> VM_MIXEDMAP: anything with a "struct page" (pfn_valid()) is refcounted >>> >>> pte_special() is a way for GUP-fast to distinguish these refcounted (can >>> GUP) from non-refcounted (camnnot GUP) pages mapped by PTEs without any >>> locks or the VMA being available. >>> >>> Setting pte_special() in VM_MIXEDMAP on ptes that have a "struct page" >>> (pfn_valid()) is likely very bogus. >> >> OK, good to know. >> >>> >>>> vm_mixed_ok() does a thorough job of validating the >>>> use of __vm_insert_mixed(), and since what I did was allowed, I thought >>>> perhaps it was OK. Your feedback has set me straight, and that's what I >>>> needed. :-) >>> >>> What exactly are you trying to achieve? :) >>> >>> If it's mapping a page with a "struct page" and *not* refcounting it, >>> then vmf_insert_pfn() is the current way to achieve that in a VM_PFNMAP >>> mapping. It will set pte_special() automatically for you. >>> >> >> Yes, that's what I'm using to initially create the special PTE in the >> .fault callback. >> >>>> >>>> But the whole approach is moot with Alistair Popple's patch set that >>>> eliminates pfn_t. Is there an existing mm API that will do mkwrite on a >>>> special PTE in a VM_PFNMAP VMA? I didn't see one, but maybe I missed >>>> it. If there's not one, I'll take a crack at adding it in the next version of my >>>> patch set. >>> >>> I assume you'd want vmf_insert_pfn_mkwrite(), correct? Probably >>> vmf_insert_pfn_prot() can be used by adding PAGE_WRITE to pgprot. (maybe >>> :) ) >> >> Ok, I'll look at that more closely. The sequence is that the special >> PTE gets created with vmf_insert_pfn(). Then when the page is first >> written to, the .pfn_mkwrite callback is invoked by mm. The question >> is the best way for that callback to mark the existing PTE as writable. >> > > FWIW, vmf_insert_pfn_prot() won't work. It calls insert_pfn() with > the "mkwrite" parameter set to 'false', in which case insert_pfn() > does nothing if the PTE already exists. Ah, you are worried about the "already exists but is R/O case". > > So I would need to create a new API that does appropriate validation > for a VM_PFNMAP VMA, and then calls insert_pfn() with the "mkwrite" > parameter set to 'true'. IMHO, nothing would speak against vmf_insert_pfn_mkwrite(). Much better than using that "mixed" ... beauty of a function.
Hi Am 04.06.25 um 23:43 schrieb Michael Kelley: [...] > Nonetheless, there's an underlying issue. A main cause of the difference > is the number of messages to Hyper-V to update dirty regions. With > hyperv_fb using deferred I/O, the messages are limited 20/second, so > the total number of messages to Hyper-V is about 480. But hyperv_drm > appears to send 3 messages to Hyper-V for each line of output, or a total of > about 3,000,000 messages (~90K/second). That's a lot of additional load > on the Hyper-V host, and it adds the 10 seconds of additional elapsed > time seen in the guest. There also this ugly output in dmesg because the > ring buffer for sending messages to the Hyper-V host gets full -- Hyper-V > doesn't always keep up, at least not on my local laptop where I'm > testing: > > [12574.327615] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12574.327684] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12574.327760] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12574.327841] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016128] hyperv_sendpacket: 6211 callbacks suppressed > [12597.016133] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016172] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016220] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > [12597.016267] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] *ERROR* Unable to send packet via vmbus; error -11 > > hyperv_drm could be fixed to not output the ugly messages, but there's > still the underlying issue of overrunning the ring buffer, and excessively > hammering on the host. If we could get hyperv_drm doing deferred I/O, I > would feel much better about going full-on with deprecating hyperv_fb. I try to address the problem with the patches at https://lore.kernel.org/dri-devel/20250605152637.98493-1-tzimmermann@suse.de/ Testing and feedback is much appreciated. Best regards Thomas > > Michael >
From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Thursday, June 5, 2025 8:36 AM > > Hi > > Am 04.06.25 um 23:43 schrieb Michael Kelley: > [...] > > Nonetheless, there's an underlying issue. A main cause of the difference > > is the number of messages to Hyper-V to update dirty regions. With > > hyperv_fb using deferred I/O, the messages are limited 20/second, so > > the total number of messages to Hyper-V is about 480. But hyperv_drm > > appears to send 3 messages to Hyper-V for each line of output, or a total of > > about 3,000,000 messages (~90K/second). That's a lot of additional load > > on the Hyper-V host, and it adds the 10 seconds of additional elapsed > > time seen in the guest. There also this ugly output in dmesg because the > > ring buffer for sending messages to the Hyper-V host gets full -- Hyper-V > > doesn't always keep up, at least not on my local laptop where I'm > > testing: > > > > [12574.327615] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > [12574.327684] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > [12574.327760] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > [12574.327841] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > [12597.016128] hyperv_sendpacket: 6211 callbacks suppressed > > [12597.016133] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > [12597.016172] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > [12597.016220] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > [12597.016267] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > *ERROR* Unable to send packet via vmbus; error -11 > > > > hyperv_drm could be fixed to not output the ugly messages, but there's > > still the underlying issue of overrunning the ring buffer, and excessively > > hammering on the host. If we could get hyperv_drm doing deferred I/O, I > > would feel much better about going full-on with deprecating hyperv_fb. > > I try to address the problem with the patches at > > https://lore.kernel.org/dri-devel/20250605152637.98493-1-tzimmermann@suse.de/ > > Testing and feedback is much appreciated. > Nice! I ran the same test case with your patches, and everything works well. The hyperv_drm numbers are now pretty much the same as the hyperv_fb numbers for both elapsed time and system CPU time -- within a few percent. For hyperv_drm, there's no longer a gap in the elapsed time and system CPU time. No errors due to the guest-to-host ring buffer being full. Total messages to Hyper-V for hyperv_drm are now a few hundred instead of 3M. The hyperv_drm message count is still a little higher than for hyperv_fb, presumably because the simulated vblank rate in hyperv_drm is higher than the 20 Hz rate used by hyperv_fb deferred I/O. But the overall numbers are small enough that the difference is in the noise. Question: what is the default value for the simulated vblank rate? Just curious ... Michael
Hi Am 05.06.25 um 19:38 schrieb Michael Kelley: [...] >> I try to address the problem with the patches at >> >> https://lore.kernel.org/dri-devel/20250605152637.98493-1-tzimmermann@suse.de/ >> >> Testing and feedback is much appreciated. >> > Nice! > > I ran the same test case with your patches, and everything works well. The > hyperv_drm numbers are now pretty much the same as the hyperv_fb > numbers for both elapsed time and system CPU time -- within a few percent. > For hyperv_drm, there's no longer a gap in the elapsed time and system > CPU time. No errors due to the guest-to-host ring buffer being full. Total > messages to Hyper-V for hyperv_drm are now a few hundred instead of 3M. Sounds great. Credit also goes to the vkms devs, which already have the software vblank in their driver. This might need better support for cases where display updates take exceptionally long, but I can see this being merged as a DRM feature. > The hyperv_drm message count is still a little higher than for hyperv_fb, > presumably because the simulated vblank rate in hyperv_drm is higher than > the 20 Hz rate used by hyperv_fb deferred I/O. But the overall numbers are > small enough that the difference is in the noise. Question: what is the default > value for the simulated vblank rate? Just curious ... As with a hardware interrupt, the vblank rate comes from the programmed display mode, so most likely 60 Hz. The difference in the update frequency could explain the remaining differences to hyperv_fb. Best regards Thomas > > Michael
From: Michael Kelley Sent: Thursday, June 5, 2025 10:39 AM > > From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Thursday, June 5, 2025 > 8:36 AM > > > > Hi > > > > Am 04.06.25 um 23:43 schrieb Michael Kelley: > > [...] > > > Nonetheless, there's an underlying issue. A main cause of the difference > > > is the number of messages to Hyper-V to update dirty regions. With > > > hyperv_fb using deferred I/O, the messages are limited 20/second, so > > > the total number of messages to Hyper-V is about 480. But hyperv_drm > > > appears to send 3 messages to Hyper-V for each line of output, or a total of > > > about 3,000,000 messages (~90K/second). That's a lot of additional load > > > on the Hyper-V host, and it adds the 10 seconds of additional elapsed > > > time seen in the guest. There also this ugly output in dmesg because the > > > ring buffer for sending messages to the Hyper-V host gets full -- Hyper-V > > > doesn't always keep up, at least not on my local laptop where I'm > > > testing: > > > > > > [12574.327615] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > [12574.327684] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > [12574.327760] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > [12574.327841] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > [12597.016128] hyperv_sendpacket: 6211 callbacks suppressed > > > [12597.016133] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > [12597.016172] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > [12597.016220] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > [12597.016267] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] > > *ERROR* Unable to send packet via vmbus; error -11 > > > > > > hyperv_drm could be fixed to not output the ugly messages, but there's > > > still the underlying issue of overrunning the ring buffer, and excessively > > > hammering on the host. If we could get hyperv_drm doing deferred I/O, I > > > would feel much better about going full-on with deprecating hyperv_fb. > > > > I try to address the problem with the patches at > > > > https://lore.kernel.org/dri-devel/20250605152637.98493-1- > tzimmermann@suse.de/ > > > > Testing and feedback is much appreciated. > > > > Nice! > > I ran the same test case with your patches, and everything works well. The > hyperv_drm numbers are now pretty much the same as the hyperv_fb > numbers for both elapsed time and system CPU time -- within a few percent. > For hyperv_drm, there's no longer a gap in the elapsed time and system > CPU time. No errors due to the guest-to-host ring buffer being full. Total > messages to Hyper-V for hyperv_drm are now a few hundred instead of 3M. > The hyperv_drm message count is still a little higher than for hyperv_fb, > presumably because the simulated vblank rate in hyperv_drm is higher than > the 20 Hz rate used by hyperv_fb deferred I/O. But the overall numbers are > small enough that the difference is in the noise. Question: what is the default > value for the simulated vblank rate? Just curious ... > FYI, I'm seeing this message occasionally when running with your simulated vblank code and hyperv_drm: [90920.128278] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] vblank timer overrun "Occasionally" is about a dozen occurrences over the last day or so. I can't yet correlate to any particular activity in the VM. The graphics console has not been very busy. Michael
Hi Am 12.06.25 um 01:18 schrieb Michael Kelley: > From: Michael Kelley Sent: Thursday, June 5, 2025 10:39 AM >> From: Thomas Zimmermann <tzimmermann@suse.de> Sent: Thursday, June 5, 2025 >> 8:36 AM >>> Hi >>> >>> Am 04.06.25 um 23:43 schrieb Michael Kelley: >>> [...] >>>> Nonetheless, there's an underlying issue. A main cause of the difference >>>> is the number of messages to Hyper-V to update dirty regions. With >>>> hyperv_fb using deferred I/O, the messages are limited 20/second, so >>>> the total number of messages to Hyper-V is about 480. But hyperv_drm >>>> appears to send 3 messages to Hyper-V for each line of output, or a total of >>>> about 3,000,000 messages (~90K/second). That's a lot of additional load >>>> on the Hyper-V host, and it adds the 10 seconds of additional elapsed >>>> time seen in the guest. There also this ugly output in dmesg because the >>>> ring buffer for sending messages to the Hyper-V host gets full -- Hyper-V >>>> doesn't always keep up, at least not on my local laptop where I'm >>>> testing: >>>> >>>> [12574.327615] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> [12574.327684] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> [12574.327760] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> [12574.327841] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> [12597.016128] hyperv_sendpacket: 6211 callbacks suppressed >>>> [12597.016133] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> [12597.016172] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> [12597.016220] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> [12597.016267] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] >>> *ERROR* Unable to send packet via vmbus; error -11 >>>> hyperv_drm could be fixed to not output the ugly messages, but there's >>>> still the underlying issue of overrunning the ring buffer, and excessively >>>> hammering on the host. If we could get hyperv_drm doing deferred I/O, I >>>> would feel much better about going full-on with deprecating hyperv_fb. >>> I try to address the problem with the patches at >>> >>> https://lore.kernel.org/dri-devel/20250605152637.98493-1- >> tzimmermann@suse.de/ >>> Testing and feedback is much appreciated. >>> >> Nice! >> >> I ran the same test case with your patches, and everything works well. The >> hyperv_drm numbers are now pretty much the same as the hyperv_fb >> numbers for both elapsed time and system CPU time -- within a few percent. >> For hyperv_drm, there's no longer a gap in the elapsed time and system >> CPU time. No errors due to the guest-to-host ring buffer being full. Total >> messages to Hyper-V for hyperv_drm are now a few hundred instead of 3M. >> The hyperv_drm message count is still a little higher than for hyperv_fb, >> presumably because the simulated vblank rate in hyperv_drm is higher than >> the 20 Hz rate used by hyperv_fb deferred I/O. But the overall numbers are >> small enough that the difference is in the noise. Question: what is the default >> value for the simulated vblank rate? Just curious ... >> > FYI, I'm seeing this message occasionally when running with your simulated > vblank code and hyperv_drm: > > [90920.128278] hyperv_drm 5620e0c7-8062-4dce-aeb7-520c7ef76171: [drm] vblank timer overrun > > "Occasionally" is about a dozen occurrences over the last day or so. I can't > yet correlate to any particular activity in the VM. The graphics console has > not been very busy. Thanks for the report. It can happen that the vblank timer is handled late. It's not strictly an error as the whole thing is best-effort anyway. Maybe the timeout could be adjusted. Best regards Thomas > > Michael >
diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c index 4fc93f253e06..f8ae91a1c4df 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -8,11 +8,40 @@ * for more details. */ +/* + * Deferred I/O ("defio") allows framebuffers that are mmap()'ed to user space + * to batch user space writes into periodic updates to the underlying + * framebuffer hardware or other implementation (such as with a virtualized + * framebuffer in a VM). At each batch interval, a callback is invoked in the + * framebuffer's kernel driver, and the callback is supplied with a list of + * pages that have been modified in the preceding interval. The callback can + * use this information to update the framebuffer hardware as necessary. The + * batching can improve performance and reduce the overhead of updating the + * hardware. + * + * Defio is supported on framebuffers allocated using vmalloc() and allocated + * as contiguous kernel memory using alloc_pages() or kmalloc(). These + * memory allocations all have corresponding "struct page"s. Framebuffers + * allocated using dma_alloc_coherent() should not be used with defio. + * Such allocations should be treated as a black box owned by the DMA + * layer, and should not be deconstructed into individual pages as defio + * does. Framebuffers in MMIO space are *not* supported because MMIO space + * does not have corrresponding "struct page"s. + * + * For framebuffers allocated using vmalloc(), struct fb_info must have + * "screen_buffer" set to the vmalloc address of the framebuffer. For + * framebuffers allocated from contiguous kernel memory, FBINFO_KMEMFB must + * be set, and "fix.smem_start" must be set to the physical address of the + * frame buffer. In both cases, "fix.smem_len" must be set to the framebuffer + * size in bytes. + */ + #include <linux/module.h> #include <linux/kernel.h> #include <linux/errno.h> #include <linux/string.h> #include <linux/mm.h> +#include <linux/pfn_t.h> #include <linux/vmalloc.h> #include <linux/delay.h> #include <linux/interrupt.h> @@ -37,7 +66,7 @@ static struct page *fb_deferred_io_get_page(struct fb_info *info, unsigned long else if (info->fix.smem_start) page = pfn_to_page((info->fix.smem_start + offs) >> PAGE_SHIFT); - if (page) + if (page && !(info->flags & FBINFO_KMEMFB)) get_page(page); return page; @@ -137,6 +166,15 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) BUG_ON(!info->fbdefio->mapping); + if (info->flags & FBINFO_KMEMFB) + /* + * In this path, the VMA is marked VM_PFNMAP, so mm assumes + * there is no struct page associated with the page. The + * PFN must be directly inserted and the created PTE will be + * marked "special". + */ + return vmf_insert_pfn(vmf->vma, vmf->address, page_to_pfn(page)); + vmf->page = page; return 0; } @@ -163,13 +201,14 @@ EXPORT_SYMBOL_GPL(fb_deferred_io_fsync); /* * Adds a page to the dirty list. Call this from struct - * vm_operations_struct.page_mkwrite. + * vm_operations_struct.page_mkwrite or .pfn_mkwrite. */ -static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long offset, +static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, struct vm_fault *vmf, struct page *page) { struct fb_deferred_io *fbdefio = info->fbdefio; struct fb_deferred_io_pageref *pageref; + unsigned long offset = vmf->pgoff << PAGE_SHIFT; vm_fault_t ret; /* protect against the workqueue changing the page list */ @@ -182,20 +221,34 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long } /* - * We want the page to remain locked from ->page_mkwrite until - * the PTE is marked dirty to avoid mapping_wrprotect_range() - * being called before the PTE is updated, which would leave - * the page ignored by defio. - * Do this by locking the page here and informing the caller - * about it with VM_FAULT_LOCKED. + * The PTE must be marked writable before the defio deferred work runs + * again and potentially marks the PTE write-protected. If the order + * should be switched, the PTE would become writable without defio + * tracking the page, leaving the page forever ignored by defio. + * + * For vmalloc() framebuffers, the associated struct page is locked + * before releasing the defio lock. mm will later mark the PTE writaable + * and release the struct page lock. The struct page lock prevents + * the page from being prematurely being marked write-protected. + * + * For FBINFO_KMEMFB framebuffers, mm assumes there is no struct page, + * so the PTE must be marked writable while the defio lock is held. */ - lock_page(pageref->page); + if (info->flags & FBINFO_KMEMFB) { + unsigned long pfn = page_to_pfn(pageref->page); + + ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, + __pfn_to_pfn_t(pfn, PFN_SPECIAL)); + } else { + lock_page(pageref->page); + ret = VM_FAULT_LOCKED; + } mutex_unlock(&fbdefio->lock); /* come back after delay to process the deferred IO */ schedule_delayed_work(&info->deferred_work, fbdefio->delay); - return VM_FAULT_LOCKED; + return ret; err_mutex_unlock: mutex_unlock(&fbdefio->lock); @@ -207,10 +260,10 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long * @fb_info: The fbdev info structure * @vmf: The VM fault * - * This is a callback we get when userspace first tries to - * write to the page. We schedule a workqueue. That workqueue - * will eventually mkclean the touched pages and execute the - * deferred framebuffer IO. Then if userspace touches a page + * This is a callback we get when userspace first tries to write to a + * page. We schedule a workqueue. That workqueue will eventually do + * mapping_wrprotect_range() on the written pages and execute the + * deferred framebuffer IO. Then if userspace writes to a page * again, we repeat the same scheme. * * Returns: @@ -218,12 +271,11 @@ static vm_fault_t fb_deferred_io_track_page(struct fb_info *info, unsigned long */ static vm_fault_t fb_deferred_io_page_mkwrite(struct fb_info *info, struct vm_fault *vmf) { - unsigned long offset = vmf->pgoff << PAGE_SHIFT; struct page *page = vmf->page; file_update_time(vmf->vma->vm_file); - return fb_deferred_io_track_page(info, offset, page); + return fb_deferred_io_track_page(info, vmf, page); } /* vm_ops->page_mkwrite handler */ @@ -234,9 +286,25 @@ static vm_fault_t fb_deferred_io_mkwrite(struct vm_fault *vmf) return fb_deferred_io_page_mkwrite(info, vmf); } +/* + * Similar to fb_deferred_io_mkwrite(), but for first writes to pages + * in VMAs that have VM_PFNMAP set. + */ +static vm_fault_t fb_deferred_io_pfn_mkwrite(struct vm_fault *vmf) +{ + struct fb_info *info = vmf->vma->vm_private_data; + unsigned long offset = vmf->pgoff << PAGE_SHIFT; + struct page *page = phys_to_page(info->fix.smem_start + offset); + + file_update_time(vmf->vma->vm_file); + + return fb_deferred_io_track_page(info, vmf, page); +} + static const struct vm_operations_struct fb_deferred_io_vm_ops = { .fault = fb_deferred_io_fault, .page_mkwrite = fb_deferred_io_mkwrite, + .pfn_mkwrite = fb_deferred_io_pfn_mkwrite, }; static const struct address_space_operations fb_deferred_io_aops = { @@ -246,11 +314,31 @@ static const struct address_space_operations fb_deferred_io_aops = { int fb_deferred_io_mmap(struct fb_info *info, struct vm_area_struct *vma) { vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); + vm_flags_t flags = VM_DONTEXPAND | VM_DONTDUMP; vma->vm_ops = &fb_deferred_io_vm_ops; - vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP); - if (!(info->flags & FBINFO_VIRTFB)) - vm_flags_set(vma, VM_IO); + if (info->flags & FBINFO_KMEMFB) { + /* + * I/O fault path calls vmf_insert_pfn(), which bug checks + * if the vma is not marked shared. mmap'ing the framebuffer + * as PRIVATE doesn't really make sense anyway, though doing + * so isn't harmful for vmalloc() framebuffers. So there's + * no prohibition for that case. + */ + if (!(vma->vm_flags & VM_SHARED)) + return -EINVAL; + /* + * Set VM_PFNMAP so mm code will not try to manage the pages' + * lifecycles. We don't want individual pages to be freed + * based on refcount. Instead the memory must be returned to + * the free pool in the usual way. Cf. the implementation of + * remap_pfn_range() and remap_pfn_range_internal(). + */ + flags |= VM_PFNMAP | VM_IO; + } else if (!(info->flags & FBINFO_VIRTFB)) { + flags |= VM_IO; + } + vm_flags_set(vma, flags); vma->vm_private_data = info; return 0; }