diff mbox series

[v3,08/15] iommufd: Algorithms for PFN storage

Message ID 8-v3-402a7d6459de+24b-iommufd_jgg@nvidia.com
State Superseded
Headers show
Series IOMMUFD Generic interface | expand

Commit Message

Jason Gunthorpe Oct. 25, 2022, 6:12 p.m. UTC
The iopt_pages which represents a logical linear list of full PFNs held in
different storage tiers. Each area points to a slice of exactly one
iopt_pages, and each iopt_pages can have multiple areas and accesses.

The three storage tiers are managed to meet these objectives:

 - If no iommu_domain or in-kerenel access exists then minimal memory
   should be consumed by iomufd
 - If a page has been pinned then an iopt_pages will not pin it again
 - If an in-kernel access exists then the xarray must provide the backing
   storage to avoid allocations on domain removals
 - Otherwise any iommu_domain will be used for storage

In a common configuration with only an iommu_domain the iopt_pages does
not allocate significant memory itself.

The external interface for pages has several logical operations:

  iopt_area_fill_domain() will load the PFNs from storage into a single
  domain. This is used when attaching a new domain to an existing IOAS.

  iopt_area_fill_domains() will load the PFNs from storage into multiple
  domains. This is used when creating a new IOVA map in an existing IOAS

  iopt_pages_add_access() creates an iopt_pages_access that tracks an
  in-kernel access of PFNs. This is some external driver that might be
  accessing the IOVA using the CPU, or programming PFNs with the DMA
  API. ie a VFIO mdev.

  iopt_pages_rw_access() directly perform a memcpy on the PFNs, without
  the overhead of iopt_pages_add_access()

  iopt_pages_fill_xarray() will load PFNs into the xarray and return a
  'struct page *' array. It is used by iopt_pages_access's to extract PFNs
  for in-kernel use. iopt_pages_fill_from_xarray() is a fast path when it
  is known the xarray is already filled.

As an iopt_pages can be referred to in slices by many areas and accesses
it uses interval trees to keep track of which storage tiers currently hold
the PFNs. On a page-by-page basis any request for a PFN will be satisfied
from one of the storage tiers and the PFN copied to target domain/array.

Unfill actions are similar, on a page by page basis domains are unmapped,
xarray entries freed or struct pages fully put back.

Significant complexity is required to fully optimize all of these data
motions. The implementation calculates the largest consecutive range of
same-storage indexes and operates in blocks. The accumulation of PFNs
always generates the largest contiguous PFN range possible to optimize and
this gathering can cross storage tier boundaries. For cases like 'fill
domains' care is taken to avoid duplicated work and PFNs are read once and
pushed into all domains.

The map/unmap interaction with the iommu_domain always works in contiguous
PFN blocks. The implementation does not require or benefit from any
split/merge optimization in the iommu_domain driver.

This design suggests several possible improvements in the IOMMU API that
would greatly help performance, particularly a way for the driver to map
and read the pfns lists instead of working with one driver call per page
to read, and one driver call per contiguous range to store.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/iommufd/io_pagetable.h |  73 +++
 drivers/iommu/iommufd/pages.c        | 803 ++++++++++++++++++++++++++-
 2 files changed, 872 insertions(+), 4 deletions(-)

Comments

Jason Gunthorpe Oct. 31, 2022, 4:01 p.m. UTC | #1
On Tue, Oct 25, 2022 at 03:12:17PM -0300, Jason Gunthorpe wrote:

> +int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start_index,
> +			   unsigned long last_index, struct page **out_pages)
> +{
> +	struct interval_tree_double_span_iter span;
> +	unsigned long xa_end = start_index;
> +	struct pfn_reader_user user;
> +	int rc;
> +
> +	pfn_reader_user_init(&user, pages);
> +	user.upages_len = last_index - start_index + 1;

Syzkaller found that upages_len is supposed to be in bytes, not number
of pages. Which is surprising the test suite didn't find it.

-       user.upages_len = last_index - start_index + 1;
+       user.upages_len = (last_index - start_index + 1) * sizeof(*out_pages);

While fixing the test suite to cover this I also discovered this:

@@ -129,7 +129,7 @@ void interval_tree_span_iter_advance(struct interval_tree_span_iter *iter,
                return;
 
        iter->first_index = new_index;
-       if (new_index == iter->last_index) {
+       if (new_index > iter->last_index) {
                iter->is_hole = -1;
                return;
        }

Where the span iterator would not behave properly at its limit,
causing some chaos.

While thinking about iommufd_access_pin_pages(), which is the API that
trigggered all these problems, I realized we need to upgrade the
iova_alignment when an access is created to at least PAGE_SIZE if the
external driver is going to call iommufd_access_pin_pages(). Otherwise
the API can't work right because there is no way to communicate
offsets in the struct pages returned when pinning. Use of RW doesn't
require this so I made it a flag, which is convient for the test suite
that assumes odd alignments for RW testing. See below

Finally, the test suite didn't cover the unmap while access exists
flow, and the code forgot to set the ops pointer:

@@ -435,6 +425,10 @@ iommufd_access_create(struct iommufd_ctx *ictx,
u32 ioas_id,
        if (IS_ERR(access))
                return access;
 
+       access->data = data;
+       access->ops = ops;
+       access->ictx = ictx;

Jason

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index d1af0389dfab83..737897a2dcfc3c 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -26,15 +26,6 @@ struct iommufd_device {
 	bool enforce_cache_coherency;
 };
 
-struct iommufd_access {
-	struct iommufd_object obj;
-	struct iommufd_ctx *ictx;
-	struct iommufd_ioas *ioas;
-	const struct iommufd_access_ops *ops;
-	void *data;
-	u32 ioas_access_list_id;
-};
-
 void iommufd_device_destroy(struct iommufd_object *obj)
 {
 	struct iommufd_device *idev =
@@ -413,8 +404,7 @@ void iommufd_access_destroy_object(struct iommufd_object *obj)
 	struct iommufd_access *access =
 		container_of(obj, struct iommufd_access, obj);
 
-	WARN_ON(xa_erase(&access->ioas->access_list,
-			 access->ioas_access_list_id) != access);
+	iopt_remove_access(&access->ioas->iopt, access);
 	iommufd_ctx_put(access->ictx);
 	refcount_dec(&access->ioas->obj.users);
 }
@@ -435,6 +425,10 @@ iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
 	if (IS_ERR(access))
 		return access;
 
+	access->data = data;
+	access->ops = ops;
+	access->ictx = ictx;
+
 	obj = iommufd_get_object(ictx, ioas_id, IOMMUFD_OBJ_IOAS);
 	if (IS_ERR(obj)) {
 		rc = PTR_ERR(obj);
@@ -443,15 +437,16 @@ iommufd_access_create(struct iommufd_ctx *ictx, u32 ioas_id,
 	access->ioas = container_of(obj, struct iommufd_ioas, obj);
 	iommufd_put_object_keep_user(obj);
 
-	rc = xa_alloc(&access->ioas->access_list, &access->ioas_access_list_id,
-		      access, xa_limit_16b, GFP_KERNEL_ACCOUNT);
+	if (ops->needs_pin_pages)
+		access->iova_alignment = PAGE_SIZE;
+	else
+		access->iova_alignment = 1;
+	rc = iopt_add_access(&access->ioas->iopt, access);
 	if (rc)
 		goto out_put_ioas;
 
 	/* The calling driver is a user until iommufd_access_destroy() */
 	refcount_inc(&access->obj.users);
-	access->ictx = ictx;
-	access->data = data;
 	iommufd_ctx_get(ictx);
 	iommufd_object_finalize(ictx, &access->obj);
 	return access;
@@ -495,18 +490,18 @@ void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
 	struct iommufd_access *access;
 	unsigned long index;
 
-	xa_lock(&ioas->access_list);
-	xa_for_each(&ioas->access_list, index, access) {
+	xa_lock(&ioas->iopt.access_list);
+	xa_for_each(&ioas->iopt.access_list, index, access) {
 		if (!iommufd_lock_obj(&access->obj))
 			continue;
-		xa_unlock(&ioas->access_list);
+		xa_unlock(&ioas->iopt.access_list);
 
 		access->ops->unmap(access->data, iova, length);
 
 		iommufd_put_object(&access->obj);
-		xa_lock(&ioas->access_list);
+		xa_lock(&ioas->iopt.access_list);
 	}
-	xa_unlock(&ioas->access_list);
+	xa_unlock(&ioas->iopt.access_list);
 }
 
 /**
@@ -591,6 +586,11 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
 	bool first = true;
 	int rc;
 
+	/* Driver didn't specify needs_pin_pages in its ops */
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
+	    WARN_ON(access->iova_alignment != PAGE_SIZE))
+		return -EINVAL;
+
 	if (!length)
 		return -EINVAL;
 	if (check_add_overflow(iova, length - 1, &last_iova))
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index bb5cb19417d696..5d4d48c80d3ad8 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -505,7 +505,8 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 			up_read(&iopt->domains_rwsem);
 			iommufd_access_notify_unmap(iopt, area_first,
 						    iopt_area_length(area));
-			WARN_ON(READ_ONCE(area->num_accesses));
+			if (WARN_ON(READ_ONCE(area->num_accesses)))
+				return -EDEADLOCK;
 			goto again;
 		}
 
@@ -643,6 +644,7 @@ int iopt_init_table(struct io_pagetable *iopt)
 	iopt->allowed_itree = RB_ROOT_CACHED;
 	iopt->reserved_itree = RB_ROOT_CACHED;
 	xa_init_flags(&iopt->domains, XA_FLAGS_ACCOUNT);
+	xa_init_flags(&iopt->access_list, XA_FLAGS_ALLOC);
 
 	/*
 	 * iopt's start as SW tables that can use the entire size_t IOVA space
@@ -669,6 +671,7 @@ void iopt_destroy_table(struct io_pagetable *iopt)
 
 	WARN_ON(!RB_EMPTY_ROOT(&iopt->reserved_itree.rb_root));
 	WARN_ON(!xa_empty(&iopt->domains));
+	WARN_ON(!xa_empty(&iopt->access_list));
 	WARN_ON(!RB_EMPTY_ROOT(&iopt->area_itree.rb_root));
 }
 
@@ -802,9 +805,12 @@ static int iopt_check_iova_alignment(struct io_pagetable *iopt,
 				     unsigned long new_iova_alignment)
 {
 	unsigned long align_mask = new_iova_alignment - 1;
+	struct iommufd_access *access;
 	struct iopt_area *area;
+	unsigned long index;
 
 	lockdep_assert_held(&iopt->iova_rwsem);
+	lockdep_assert_held(&iopt->domains_rwsem);
 
 	for (area = iopt_area_iter_first(iopt, 0, ULONG_MAX); area;
 	     area = iopt_area_iter_next(area, 0, ULONG_MAX))
@@ -812,6 +818,12 @@ static int iopt_check_iova_alignment(struct io_pagetable *iopt,
 		    (iopt_area_length(area) & align_mask) ||
 		    (area->page_offset & align_mask))
 			return -EADDRINUSE;
+
+	if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+		xa_for_each(&iopt->access_list, index, access)
+			if (WARN_ON(access->iova_alignment >
+				    new_iova_alignment))
+				return -EADDRINUSE;
 	return 0;
 }
 
@@ -896,10 +908,12 @@ int iopt_table_add_domain(struct io_pagetable *iopt,
 static int iopt_calculate_iova_alignment(struct io_pagetable *iopt)
 {
 	unsigned long new_iova_alignment;
+	struct iommufd_access *access;
 	struct iommu_domain *domain;
 	unsigned long index;
 
 	lockdep_assert_held_write(&iopt->iova_rwsem);
+	lockdep_assert_held(&iopt->domains_rwsem);
 
 	if (iopt->disable_large_pages)
 		new_iova_alignment = PAGE_SIZE;
@@ -910,6 +924,11 @@ static int iopt_calculate_iova_alignment(struct io_pagetable *iopt)
 		new_iova_alignment = max_t(unsigned long,
 					   1UL << __ffs(domain->pgsize_bitmap),
 					   new_iova_alignment);
+	xa_for_each(&iopt->access_list, index, access)
+		new_iova_alignment = max_t(unsigned long,
+					   access->iova_alignment,
+					   new_iova_alignment);
+
 	if (new_iova_alignment > iopt->iova_alignment) {
 		int rc;
 
@@ -1106,6 +1125,41 @@ int iopt_disable_large_pages(struct io_pagetable *iopt)
 	return rc;
 }
 
+int iopt_add_access(struct io_pagetable *iopt, struct iommufd_access *access)
+{
+	int rc;
+
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	rc = xa_alloc(&iopt->access_list, &access->iopt_access_list_id, access,
+		      xa_limit_16b, GFP_KERNEL_ACCOUNT);
+	if (rc)
+		goto out_unlock;
+
+	rc = iopt_calculate_iova_alignment(iopt);
+	if (rc) {
+		xa_erase(&iopt->access_list, access->iopt_access_list_id);
+		goto out_unlock;
+	}
+
+out_unlock:
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+	return rc;
+}
+
+void iopt_remove_access(struct io_pagetable *iopt,
+			struct iommufd_access *access)
+{
+	down_write(&iopt->domains_rwsem);
+	down_write(&iopt->iova_rwsem);
+	WARN_ON(xa_erase(&iopt->access_list, access->iopt_access_list_id) !=
+		access);
+	WARN_ON(iopt_calculate_iova_alignment(iopt));
+	up_write(&iopt->iova_rwsem);
+	up_write(&iopt->domains_rwsem);
+}
+
 /* Narrow the valid_iova_itree to include reserved ranges from a group. */
 int iopt_table_enforce_group_resv_regions(struct io_pagetable *iopt,
 					  struct device *device,
diff --git a/drivers/iommu/iommufd/ioas.c b/drivers/iommu/iommufd/ioas.c
index c32a87f11c55be..068055272fc5b5 100644
--- a/drivers/iommu/iommufd/ioas.c
+++ b/drivers/iommu/iommufd/ioas.c
@@ -17,7 +17,6 @@ void iommufd_ioas_destroy(struct iommufd_object *obj)
 	rc = iopt_unmap_all(&ioas->iopt, NULL);
 	WARN_ON(rc && rc != -ENOENT);
 	iopt_destroy_table(&ioas->iopt);
-	WARN_ON(!xa_empty(&ioas->access_list));
 	mutex_destroy(&ioas->mutex);
 }
 
@@ -36,7 +35,6 @@ struct iommufd_ioas *iommufd_ioas_alloc(struct iommufd_ctx *ictx)
 
 	INIT_LIST_HEAD(&ioas->hwpt_list);
 	mutex_init(&ioas->mutex);
-	xa_init_flags(&ioas->access_list, XA_FLAGS_ALLOC);
 	return ioas;
 
 out_abort:
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 47d18388dc24fa..783fbbf0b732d4 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -34,6 +34,7 @@ struct iommufd_ctx {
 struct io_pagetable {
 	struct rw_semaphore domains_rwsem;
 	struct xarray domains;
+	struct xarray access_list;
 	unsigned int next_domain_id;
 
 	struct rw_semaphore iova_rwsem;
@@ -205,7 +206,6 @@ struct iommufd_ioas {
 	struct io_pagetable iopt;
 	struct mutex mutex;
 	struct list_head hwpt_list;
-	struct xarray access_list;
 };
 
 static inline struct iommufd_ioas *iommufd_get_ioas(struct iommufd_ucmd *ucmd,
@@ -256,10 +256,22 @@ void iommufd_hw_pagetable_destroy(struct iommufd_object *obj);
 
 void iommufd_device_destroy(struct iommufd_object *obj);
 
+struct iommufd_access {
+	struct iommufd_object obj;
+	struct iommufd_ctx *ictx;
+	struct iommufd_ioas *ioas;
+	const struct iommufd_access_ops *ops;
+	void *data;
+	unsigned long iova_alignment;
+	u32 iopt_access_list_id;
+};
+
+int iopt_add_access(struct io_pagetable *iopt, struct iommufd_access *access);
+void iopt_remove_access(struct io_pagetable *iopt,
+			struct iommufd_access *access);
 void iommufd_access_destroy_object(struct iommufd_object *obj);
 
 #ifdef CONFIG_IOMMUFD_TEST
-struct iommufd_access;
 struct iommufd_hw_pagetable *
 iommufd_device_selftest_attach(struct iommufd_ctx *ictx,
 			       struct iommufd_ioas *ioas,
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index fc253a4d2f8e77..7a5d64a1dae482 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -29,6 +29,7 @@ int iommufd_device_attach(struct iommufd_device *idev, u32 *pt_id,
 void iommufd_device_detach(struct iommufd_device *idev);
 
 struct iommufd_access_ops {
+	u8 needs_pin_pages : 1;
 	void (*unmap)(void *data, unsigned long iova, unsigned long length);
 };
Jason Gunthorpe Nov. 1, 2022, 4:09 p.m. UTC | #2
On Tue, Oct 25, 2022 at 03:12:17PM -0300, Jason Gunthorpe wrote:

> +/**
> + * iopt_unmap_domain() - Unmap without unpinning PFNs in a domain
> + * @iopt: The iopt the domain is part of
> + * @domain: The domain to unmap
> + *
> + * The caller must know that unpinning is not required, usually because there
> + * are other domains in the iopt.
> + */
> +void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain)
> +{
> +	struct interval_tree_span_iter span;
> +
> +	interval_tree_for_each_span(&span, &iopt->area_itree, 0, ULONG_MAX)
> +		if (!span.is_hole)
> +			iommu_unmap_nofail(domain, span.start_used,
> +					   span.last_used - span.start_used +
> +						   1);
> +}

Syzkaller found a splat here.

The area_itree maintains "tombstones" of areas that are allocated, and
IOVA reserved, but not yet populated in the domain. This scheme
reduces the scope of the area_itree lock to only around allocation,
and not the much slower population step. It is the basis for parallel
mapping operation.

However, when a area is tombstoned with a area->pages == NULL it also
hasn't been mapped to a domain and thus cannot be unmapped

The above span iterator is actually iterating over all the allocated
IOVAs, not the IOVAs that have been populated to a domain. To get that
we need to check area->pages before allowing the area to become a
'used'. Thus if we race domain destruction with map and hit the
tombstone we can trigger a WARN_ON in the selftest that the IOVA being
unmapped is not mapped.

A case of incorrect optimization. The fix is simple enough, just
remove the span iterator (and audit that no other uses of area_itree
are mis-using the span iterator)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 5d4d48c80d3ad8..d1ba31491babb0 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -709,15 +709,14 @@ static void iopt_unfill_domain(struct io_pagetable *iopt,
 				continue;
 
 			mutex_lock(&pages->mutex);
-			if (area->storage_domain != domain) {
-				mutex_unlock(&pages->mutex);
-				continue;
-			}
-			area->storage_domain = storage_domain;
+			if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
+				WARN_ON(!area->storage_domain);
+			if (area->storage_domain == domain)
+				area->storage_domain = storage_domain;
 			mutex_unlock(&pages->mutex);
-		}
 
-		iopt_unmap_domain(iopt, domain);
+			iopt_area_unmap_domain(area, domain);
+		}
 		return;
 	}
 
diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index 0bd645da97809a..3b85fa344f6be3 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -65,7 +65,8 @@ void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
 int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
 void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
 			     struct iommu_domain *domain);
-void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain);
+void iopt_area_unmap_domain(struct iopt_area *area,
+			    struct iommu_domain *domain);
 
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index b15967d87ffa86..dcc4620cd5d5e6 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -1186,22 +1186,17 @@ static void iopt_area_unfill_partial_domain(struct iopt_area *area,
 }
 
 /**
- * iopt_unmap_domain() - Unmap without unpinning PFNs in a domain
- * @iopt: The iopt the domain is part of
+ * iopt_area_unmap_domain() - Unmap without unpinning PFNs in a domain
+ * @area: The IOVA range to unmap
  * @domain: The domain to unmap
  *
  * The caller must know that unpinning is not required, usually because there
  * are other domains in the iopt.
  */
-void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain)
+void iopt_area_unmap_domain(struct iopt_area *area, struct iommu_domain *domain)
 {
-	struct interval_tree_span_iter span;
-
-	interval_tree_for_each_span(&span, &iopt->area_itree, 0, ULONG_MAX)
-		if (!span.is_hole)
-			iommu_unmap_nofail(domain, span.start_used,
-					   span.last_used - span.start_used +
-						   1);
+	iommu_unmap_nofail(domain, iopt_area_iova(area),
+			   iopt_area_length(area));
 }
 
 /**
Jason Gunthorpe Nov. 3, 2022, 8:08 p.m. UTC | #3
On Tue, Oct 25, 2022 at 03:12:17PM -0300, Jason Gunthorpe wrote:
> +
> +/**
> + * iopt_area_fill_domains() - Install PFNs into the area's domains
> + * @area: The area to act on
> + * @pages: The pages associated with the area (area->pages is NULL)
> + *
> + * Called during area creation. The area is freshly created and not inserted in
> + * the domains_itree yet. PFNs are read and loaded into every domain held in the
> + * area's io_pagetable and the area is installed in the domains_itree.
> + *
> + * On failure all domains are left unchanged.
> + */
> +int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
> +{
> +	struct pfn_reader pfns;
> +	struct iommu_domain *domain;
> +	unsigned long unmap_index;
> +	unsigned long index;
> +	int rc;
> +
> +	lockdep_assert_held(&area->iopt->domains_rwsem);
> +
> +	if (xa_empty(&area->iopt->domains))
> +		return 0;
> +
> +	mutex_lock(&pages->mutex);
> +	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
> +			      iopt_area_last_index(area));
> +	if (rc)
> +		goto out_unlock;
> +
> +	while (!pfn_reader_done(&pfns)) {
> +		xa_for_each(&area->iopt->domains, index, domain) {
> +			rc = batch_to_domain(&pfns.batch, domain, area,
> +					     pfns.batch_start_index);
> +			if (rc)
> +				goto out_unmap;
> +		}
> +
> +		rc = pfn_reader_next(&pfns);
> +		if (rc)
> +			goto out_unmap;
> +	}
> +	rc = pfn_reader_update_pinned(&pfns);
> +	if (rc)
> +		goto out_unmap;
> +
> +	area->storage_domain = xa_load(&area->iopt->domains, 0);
> +	interval_tree_insert(&area->pages_node, &pages->domains_itree);
> +	goto out_destroy;
> +
> +out_unmap:
> +	xa_for_each(&area->iopt->domains, unmap_index, domain) {
> +		unsigned long end_index = pfns.batch_start_index;
> +
> +		if (unmap_index <= index)
> +			end_index = pfns.batch_end_index;

syzkaller found that there is a typo here, it should be <

However, I wasn't able to make a quick reproduction for something that
should have a been a very reliable failure path using nth fault
injection. This led to a great big adventure where I discovered that
fault injection and xarray do not play nicely together:

https://lore.kernel.org/r/Y2QR0EDvq7p9i1xw@nvidia.com

Which ended up spending a whole bunch of time to add a nth failure
study to the test suite and understand what is going on and how to
make it work better. It now covers this scenario deterministically.

The exhaustive nth failure study also shows this error handling has
another, more serious, problem. We keep track of how many pages have
been pinned inside the pages, and we also keep track of the last
charge to the rlimit/etc. At the end of operations these are
reconciled. There are lots of assertions checking that this is being
tracked properly so that we don't loose track of a pinned page in the
very complicated logic.

The new test suite discovered this missing:

 		/* Any pages not transferred to the batch are just unpinned */
 		unpin_user_pages(pfns->user.upages + (pfns->batch_end_index -
 						      pfns->user.upages_start),
 				 npages);
+		iopt_pages_sub_npinned(pages, npages);

We need to charge back the as-yet-unprocessed pages we are unpinning
when destroying the batch.

And then we get into trouble that things are not happening in the
order the assertions would like as this:

> +			iopt_area_unfill_partial_domain(area, pages, domain,
> +							end_index);

Demands that the pinned page charge during unfilling be only
decreasing, never increasing. However we can still be holding pinned
pages in the batch at this instant that don't get cleaned up until:

> +		}
> +	}
> +out_destroy:
> +	pfn_reader_destroy(&pfns);

Here

Thus the assertions get unhappy.

Introducing a pfn_reader_release_pins() which is called before
unfilling gets everything in the right order and the testing of these
two functions now passes, though I still have to insert a few more
error injection points to get full coverage.

Syzkaller has found another 4 things I still have to look at and is
now sitting at 65%(72%) coverage. So steadily progressing..

Jason

diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 245e7b96902107..ce707d6f5ee959 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -994,7 +994,15 @@ static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
 	return 0;
 }
 
-static void pfn_reader_destroy(struct pfn_reader *pfns)
+/*
+ * There are many assertions regarding the state of pages->npinned vs
+ * pages->last_pinned, for instance something like unmapping a domain must only
+ * decrement the npinned, and pfn_reader_destroy() must be called only after all
+ * the pins are updated. This is fine for success flows, but error flows
+ * sometimes need to release the pins held inside the pfn_reader before going on
+ * to complete unmapping and releasing pins held in domains.
+ */
+static void pfn_reader_release_pins(struct pfn_reader *pfns)
 {
 	struct iopt_pages *pages = pfns->pages;
 
@@ -1005,12 +1013,20 @@ static void pfn_reader_destroy(struct pfn_reader *pfns)
 		unpin_user_pages(pfns->user.upages + (pfns->batch_end_index -
 						      pfns->user.upages_start),
 				 npages);
+		iopt_pages_sub_npinned(pages, npages);
+		pfns->user.upages_end = pfns->batch_end_index;
 	}
-
-	pfn_reader_user_destroy(&pfns->user, pfns->pages);
-
 	if (pfns->batch_start_index != pfns->batch_end_index)
 		pfn_reader_unpin(pfns);
+	pfns->batch_start_index = pfns->batch_end_index;
+}
+
+static void pfn_reader_destroy(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+
+	pfn_reader_release_pins(pfns);
+	pfn_reader_user_destroy(&pfns->user, pfns->pages);
 	batch_destroy(&pfns->batch, NULL);
 	WARN_ON(pages->last_npinned != pages->npinned);
 }
@@ -1223,6 +1239,7 @@ void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
  */
 int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 {
+	unsigned long done_end_index;
 	struct pfn_reader pfns;
 	int rc;
 
@@ -1234,10 +1251,12 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 		return rc;
 
 	while (!pfn_reader_done(&pfns)) {
+		done_end_index = pfns.batch_start_index;
 		rc = batch_to_domain(&pfns.batch, domain, area,
 				     pfns.batch_start_index);
 		if (rc)
 			goto out_unmap;
+		done_end_index = pfns.batch_end_index;
 
 		rc = pfn_reader_next(&pfns);
 		if (rc)
@@ -1250,8 +1269,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 	goto out_destroy;
 
 out_unmap:
+	pfn_reader_release_pins(&pfns);
 	iopt_area_unfill_partial_domain(area, area->pages, domain,
-					pfns.batch_start_index);
+					done_end_index);
 out_destroy:
 	pfn_reader_destroy(&pfns);
 	return rc;
@@ -1270,9 +1290,11 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
  */
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 {
-	struct pfn_reader pfns;
+	unsigned long done_first_end_index;
+	unsigned long done_all_end_index;
 	struct iommu_domain *domain;
 	unsigned long unmap_index;
+	struct pfn_reader pfns;
 	unsigned long index;
 	int rc;
 
@@ -1288,12 +1310,15 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 		goto out_unlock;
 
 	while (!pfn_reader_done(&pfns)) {
+		done_first_end_index = pfns.batch_end_index;
+		done_all_end_index = pfns.batch_start_index;
 		xa_for_each(&area->iopt->domains, index, domain) {
 			rc = batch_to_domain(&pfns.batch, domain, area,
 					     pfns.batch_start_index);
 			if (rc)
 				goto out_unmap;
 		}
+		done_all_end_index = done_first_end_index;
 
 		rc = pfn_reader_next(&pfns);
 		if (rc)
@@ -1308,11 +1333,14 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 	goto out_destroy;
 
 out_unmap:
+	pfn_reader_release_pins(&pfns);
 	xa_for_each(&area->iopt->domains, unmap_index, domain) {
-		unsigned long end_index = pfns.batch_start_index;
+		unsigned long end_index;
 
-		if (unmap_index <= index)
-			end_index = pfns.batch_end_index;
+		if (unmap_index < index)
+			end_index = done_first_end_index;
+		else
+			end_index = done_all_end_index;
 
 		/*
 		 * The area is not yet part of the domains_itree so we have to
Jason Gunthorpe Nov. 4, 2022, 4:04 p.m. UTC | #4
On Tue, Oct 25, 2022 at 03:12:17PM -0300, Jason Gunthorpe wrote:
> +/**
> + * iopt_pages_unfill_xarray() - Update the xarry after removing an access
> + * @pages: The pages to act on
> + * @start_index: Starting PFN index
> + * @last_index: Last PFN index
> + *
> + * Called when an iopt_pages_access is removed, removes pages from the itree.
> + * The access should already be removed from the access_itree.
> + */
> +void iopt_pages_unfill_xarray(struct iopt_pages *pages,
> +			      unsigned long start_index,
> +			      unsigned long last_index)
> +{
> +	struct interval_tree_double_span_iter span;
> +	u64 backup[BATCH_BACKUP_SIZE];
> +	struct pfn_batch batch;
> +	bool batch_inited = false;
> +
> +	lockdep_assert_held(&pages->mutex);
> +
> +	interval_tree_for_each_double_span(&span, &pages->access_itree,
> +					   &pages->domains_itree, start_index,
> +					   last_index) {
> +		if (!span.is_used) {
> +			if (!batch_inited) {
> +				batch_init_backup(&batch,
> +						  last_index - start_index + 1,
> +						  backup, sizeof(backup));
> +				batch_inited = true;
> +			}
> +			batch_from_xarray_clear(&batch, &pages->pinned_pfns,
> +						span.start_hole,
> +						span.last_hole);
> +			batch_unpin(&batch, pages, 0, batch.total_pfns);
> +			batch_clear(&batch);

Syzkaller, and then the nth test suite found this - the required loop
around the 'batch_from' is missing. The normal test suite should have
found this, but it turns out it is using huge pages, and not enough of
them to overfill even the limited batch.

Testing also showed that batch overflow accidently zerod an xarray
entry that was not actually stored. Thus this:

diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index da95b28b41772c..3640f83f57069b 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -479,10 +479,11 @@ static void batch_from_xarray_clear(struct pfn_batch *batch, struct xarray *xa,
 		entry = xas_next(&xas);
 		if (xas_retry(&xas, entry))
 			continue;
-		xas_store(&xas, NULL);
 		WARN_ON(!xa_is_value(entry));
-		if (!batch_add_pfn(batch, xa_to_value(entry)) ||
-		    start_index == last_index)
+		if (!batch_add_pfn(batch, xa_to_value(entry)))
+			break;
+		xas_store(&xas, NULL);
+		if (start_index == last_index)
 			break;
 		start_index++;
 	}
@@ -1407,6 +1408,20 @@ void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
 	mutex_unlock(&pages->mutex);
 }
 
+static void iopt_pages_unpin_xarray(struct pfn_batch *batch,
+				    struct iopt_pages *pages,
+				    unsigned long start_index,
+				    unsigned long end_index)
+{
+	while (start_index <= end_index) {
+		batch_from_xarray_clear(batch, &pages->pinned_pfns, start_index,
+					end_index);
+		batch_unpin(batch, pages, 0, batch->total_pfns);
+		start_index += batch->total_pfns;
+		batch_clear(batch);
+	}
+}
+
 /**
  * iopt_pages_unfill_xarray() - Update the xarry after removing an access
  * @pages: The pages to act on
@@ -1437,11 +1452,8 @@ void iopt_pages_unfill_xarray(struct iopt_pages *pages,
 						  backup, sizeof(backup));
 				batch_inited = true;
 			}
-			batch_from_xarray_clear(&batch, &pages->pinned_pfns,
-						span.start_hole,
+			iopt_pages_unpin_xarray(&batch, pages, span.start_hole,
 						span.last_hole);
-			batch_unpin(&batch, pages, 0, batch.total_pfns);
-			batch_clear(&batch);
 		} else if (span.is_used == 2) {
 			/* Covered by a domain */
 			clear_xarray(&pages->pinned_pfns, span.start_used,
Jason Gunthorpe Nov. 4, 2022, 4:26 p.m. UTC | #5
On Thu, Nov 03, 2022 at 05:08:08PM -0300, Jason Gunthorpe wrote:
> +static void pfn_reader_release_pins(struct pfn_reader *pfns)
>  {
>  	struct iopt_pages *pages = pfns->pages;
>  
> @@ -1005,12 +1013,20 @@ static void pfn_reader_destroy(struct pfn_reader *pfns)
>  		unpin_user_pages(pfns->user.upages + (pfns->batch_end_index -
>  						      pfns->user.upages_start),
>  				 npages);
> +		iopt_pages_sub_npinned(pages, npages);
> +		pfns->user.upages_end = pfns->batch_end_index;
>  	}

Syzkaller says the hidden if above is no good on error paths where the
pfn_reader has already advanced - it makes npages go negative:

-       if (pfns->user.upages) {
+       if (pfns->user.upages_end > pfns->batch_end_index) {

Jason
diff mbox series

Patch

diff --git a/drivers/iommu/iommufd/io_pagetable.h b/drivers/iommu/iommufd/io_pagetable.h
index b74bf01ffc52c2..083451afcdcf46 100644
--- a/drivers/iommu/iommufd/io_pagetable.h
+++ b/drivers/iommu/iommufd/io_pagetable.h
@@ -49,6 +49,14 @@  struct iopt_area {
 	unsigned int num_accesses;
 };
 
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages);
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages);
+
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain);
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain);
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain);
+
 static inline unsigned long iopt_area_index(struct iopt_area *area)
 {
 	return area->pages_node.start;
@@ -69,6 +77,39 @@  static inline unsigned long iopt_area_last_iova(struct iopt_area *area)
 	return area->node.last;
 }
 
+static inline size_t iopt_area_length(struct iopt_area *area)
+{
+	return (area->node.last - area->node.start) + 1;
+}
+
+#define __make_iopt_iter(name)                                                 \
+	static inline struct iopt_##name *iopt_##name##_iter_first(            \
+		struct io_pagetable *iopt, unsigned long start,                \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+									       \
+		lockdep_assert_held(&iopt->iova_rwsem);                        \
+		node = interval_tree_iter_first(&iopt->name##_itree, start,    \
+						last);                         \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}                                                                      \
+	static inline struct iopt_##name *iopt_##name##_iter_next(             \
+		struct iopt_##name *last_node, unsigned long start,            \
+		unsigned long last)                                            \
+	{                                                                      \
+		struct interval_tree_node *node;                               \
+									       \
+		node = interval_tree_iter_next(&last_node->node, start, last); \
+		if (!node)                                                     \
+			return NULL;                                           \
+		return container_of(node, struct iopt_##name, node);           \
+	}
+
+__make_iopt_iter(area)
+
 enum {
 	IOPT_PAGES_ACCOUNT_NONE = 0,
 	IOPT_PAGES_ACCOUNT_USER = 1,
@@ -106,4 +147,36 @@  struct iopt_pages {
 	struct rb_root_cached domains_itree;
 };
 
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable);
+void iopt_release_pages(struct kref *kref);
+static inline void iopt_put_pages(struct iopt_pages *pages)
+{
+	kref_put(&pages->kref, iopt_release_pages);
+}
+
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages, unsigned long start,
+				 unsigned long last, struct page **out_pages);
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start,
+			   unsigned long last, struct page **out_pages);
+void iopt_pages_unfill_xarray(struct iopt_pages *pages, unsigned long start,
+			      unsigned long last);
+
+int iopt_pages_add_access(struct iopt_pages *pages, unsigned long start,
+			unsigned long last, struct page **out_pages,
+			unsigned int flags);
+void iopt_pages_remove_access(struct iopt_area *area, unsigned long start,
+			    unsigned long last);
+int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
+			 void *data, unsigned long length, unsigned int flags);
+
+/*
+ * Each interval represents an active iopt_access_pages(), it acts as an
+ * interval lock that keeps the PFNs pinned and stored in the xarray.
+ */
+struct iopt_pages_access {
+	struct interval_tree_node node;
+	refcount_t refcount;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index dd74aea60d2cc3..a09e197524be3b 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -162,6 +162,18 @@  static void iommu_unmap_nofail(struct iommu_domain *domain, unsigned long iova,
 	WARN_ON(ret != size);
 }
 
+static void iopt_area_unmap_domain_range(struct iopt_area *area,
+					 struct iommu_domain *domain,
+					 unsigned long start_index,
+					 unsigned long last_index)
+{
+	unsigned long start_iova = iopt_area_index_to_iova(area, start_index);
+
+	iommu_unmap_nofail(domain, start_iova,
+			   iopt_area_index_to_iova_last(area, last_index) -
+				   start_iova + 1);
+}
+
 static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
 						     unsigned long index)
 {
@@ -182,7 +194,7 @@  static struct iopt_area *iopt_pages_find_domain_area(struct iopt_pages *pages,
  */
 struct pfn_batch {
 	unsigned long *pfns;
-	u16 *npfns;
+	u32 *npfns;
 	unsigned int array_size;
 	unsigned int end;
 	unsigned int total_pfns;
@@ -232,7 +244,7 @@  static int __batch_init(struct pfn_batch *batch, size_t max_pages, void *backup,
 	if (!batch->pfns)
 		return -ENOMEM;
 	batch->array_size = size / elmsz;
-	batch->npfns = (u16 *)(batch->pfns + batch->array_size);
+	batch->npfns = (u32 *)(batch->pfns + batch->array_size);
 	batch_clear(batch);
 	return 0;
 }
@@ -257,10 +269,11 @@  static void batch_destroy(struct pfn_batch *batch, void *backup)
 /* true if the pfn could be added, false otherwise */
 static bool batch_add_pfn(struct pfn_batch *batch, unsigned long pfn)
 {
-	/* FIXME: U16 is too small */
+	const unsigned int MAX_NPFNS = type_max(typeof(*batch->npfns));
+
 	if (batch->end &&
 	    pfn == batch->pfns[batch->end - 1] + batch->npfns[batch->end - 1] &&
-	    batch->npfns[batch->end - 1] != U16_MAX) {
+	    batch->npfns[batch->end - 1] != MAX_NPFNS) {
 		batch->npfns[batch->end - 1]++;
 		batch->total_pfns++;
 		return true;
@@ -994,3 +1007,785 @@  static int pfn_reader_first(struct pfn_reader *pfns, struct iopt_pages *pages,
 	}
 	return 0;
 }
+
+struct iopt_pages *iopt_alloc_pages(void __user *uptr, unsigned long length,
+				    bool writable)
+{
+	struct iopt_pages *pages;
+
+	/*
+	 * The iommu API uses size_t as the length, and protect the DIV_ROUND_UP
+	 * below from overflow
+	 */
+	if (length > SIZE_MAX - PAGE_SIZE || length == 0)
+		return ERR_PTR(-EINVAL);
+
+	pages = kzalloc(sizeof(*pages), GFP_KERNEL_ACCOUNT);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&pages->kref);
+	xa_init_flags(&pages->pinned_pfns, XA_FLAGS_ACCOUNT);
+	mutex_init(&pages->mutex);
+	pages->source_mm = current->mm;
+	mmgrab(pages->source_mm);
+	pages->uptr = (void __user *)ALIGN_DOWN((uintptr_t)uptr, PAGE_SIZE);
+	pages->npages = DIV_ROUND_UP(length + (uptr - pages->uptr), PAGE_SIZE);
+	pages->access_itree = RB_ROOT_CACHED;
+	pages->domains_itree = RB_ROOT_CACHED;
+	pages->writable = writable;
+	if (capable(CAP_IPC_LOCK))
+		pages->account_mode = IOPT_PAGES_ACCOUNT_NONE;
+	else
+		pages->account_mode = IOPT_PAGES_ACCOUNT_USER;
+	pages->source_task = current->group_leader;
+	get_task_struct(current->group_leader);
+	pages->source_user = get_uid(current_user());
+	return pages;
+}
+
+void iopt_release_pages(struct kref *kref)
+{
+	struct iopt_pages *pages = container_of(kref, struct iopt_pages, kref);
+
+	WARN_ON(!RB_EMPTY_ROOT(&pages->access_itree.rb_root));
+	WARN_ON(!RB_EMPTY_ROOT(&pages->domains_itree.rb_root));
+	WARN_ON(pages->npinned);
+	WARN_ON(!xa_empty(&pages->pinned_pfns));
+	mmdrop(pages->source_mm);
+	mutex_destroy(&pages->mutex);
+	put_task_struct(pages->source_task);
+	free_uid(pages->source_user);
+	kfree(pages);
+}
+
+static void
+iopt_area_unpin_domain(struct pfn_batch *batch, struct iopt_area *area,
+		       struct iopt_pages *pages, struct iommu_domain *domain,
+		       unsigned long start_index, unsigned long last_index,
+		       unsigned long *unmapped_end_index,
+		       unsigned long real_last_index)
+{
+	while (start_index <= last_index) {
+		unsigned long batch_last_index;
+
+		if (*unmapped_end_index <= last_index) {
+			unsigned long start =
+				max(start_index, *unmapped_end_index);
+
+			batch_from_domain(batch, domain, area, start,
+					  last_index);
+			batch_last_index = start + batch->total_pfns - 1;
+		} else {
+			batch_last_index = last_index;
+		}
+
+		/*
+		 * unmaps must always 'cut' at a place where the pfns are not
+		 * contiguous to pair with the maps that always install
+		 * contiguous pages. Thus, if we have to stop unpinning in the
+		 * middle of the domains we need to keep reading pfns until we
+		 * find a cut point to do the unmap. The pfns we read are
+		 * carried over and either skipped or integrated into the next
+		 * batch.
+		 */
+		if (batch_last_index == last_index &&
+		    last_index != real_last_index)
+			batch_from_domain_continue(batch, domain, area,
+						   last_index + 1,
+						   real_last_index);
+
+		if (*unmapped_end_index <= batch_last_index) {
+			iopt_area_unmap_domain_range(
+				area, domain, *unmapped_end_index,
+				start_index + batch->total_pfns - 1);
+			*unmapped_end_index = start_index + batch->total_pfns;
+		}
+
+		/* unpin must follow unmap */
+		batch_unpin(batch, pages, 0,
+			    batch_last_index - start_index + 1);
+		start_index = batch_last_index + 1;
+
+		batch_clear_carry(batch,
+				  *unmapped_end_index - batch_last_index - 1);
+	}
+}
+
+static void __iopt_area_unfill_domain(struct iopt_area *area,
+				      struct iopt_pages *pages,
+				      struct iommu_domain *domain,
+				      unsigned long last_index)
+{
+	struct interval_tree_double_span_iter span;
+	unsigned long start_index = iopt_area_index(area);
+	unsigned long unmapped_end_index = start_index;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+
+	lockdep_assert_held(&pages->mutex);
+
+	batch_init_backup(&batch, last_index + 1, backup, sizeof(backup));
+	interval_tree_for_each_double_span(&span, &pages->domains_itree,
+					   &pages->access_itree, start_index,
+					   last_index) {
+		if (span.is_used) {
+			batch_skip_carry(&batch,
+					 span.last_used - span.start_used + 1);
+			continue;
+		}
+		iopt_area_unpin_domain(&batch, area, pages, domain,
+				       span.start_hole, span.last_hole,
+				       &unmapped_end_index, last_index);
+	}
+	if (unmapped_end_index != last_index + 1)
+		iopt_area_unmap_domain_range(area, domain, unmapped_end_index,
+					     last_index);
+	WARN_ON(batch.total_pfns);
+	batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+static void iopt_area_unfill_partial_domain(struct iopt_area *area,
+					    struct iopt_pages *pages,
+					    struct iommu_domain *domain,
+					    unsigned long end_index)
+{
+	if (end_index != iopt_area_index(area))
+		__iopt_area_unfill_domain(area, pages, domain, end_index - 1);
+}
+
+/**
+ * iopt_unmap_domain() - Unmap without unpinning PFNs in a domain
+ * @iopt: The iopt the domain is part of
+ * @domain: The domain to unmap
+ *
+ * The caller must know that unpinning is not required, usually because there
+ * are other domains in the iopt.
+ */
+void iopt_unmap_domain(struct io_pagetable *iopt, struct iommu_domain *domain)
+{
+	struct interval_tree_span_iter span;
+
+	interval_tree_for_each_span(&span, &iopt->area_itree, 0, ULONG_MAX)
+		if (!span.is_hole)
+			iommu_unmap_nofail(domain, span.start_used,
+					   span.last_used - span.start_used +
+						   1);
+}
+
+/**
+ * iopt_area_unfill_domain() - Unmap and unpin PFNs in a domain
+ * @area: IOVA area to use
+ * @pages: page supplier for the area (area->pages is NULL)
+ * @domain: Domain to unmap from
+ *
+ * The domain should be removed from the domains_itree before calling. The
+ * domain will always be unmapped, but the PFNs may not be unpinned if there are
+ * still accesses.
+ */
+void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
+			     struct iommu_domain *domain)
+{
+	__iopt_area_unfill_domain(area, pages, domain,
+				  iopt_area_last_index(area));
+}
+
+/**
+ * iopt_area_fill_domain() - Map PFNs from the area into a domain
+ * @area: IOVA area to use
+ * @domain: Domain to load PFNs into
+ *
+ * Read the pfns from the area's underlying iopt_pages and map them into the
+ * given domain. Called when attaching a new domain to an io_pagetable.
+ */
+int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
+{
+	struct pfn_reader pfns;
+	int rc;
+
+	lockdep_assert_held(&area->pages->mutex);
+
+	rc = pfn_reader_first(&pfns, area->pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		return rc;
+
+	while (!pfn_reader_done(&pfns)) {
+		rc = batch_to_domain(&pfns.batch, domain, area,
+				     pfns.batch_start_index);
+		if (rc)
+			goto out_unmap;
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+
+	rc = pfn_reader_update_pinned(&pfns);
+	if (rc)
+		goto out_unmap;
+	goto out_destroy;
+
+out_unmap:
+	iopt_area_unfill_partial_domain(area, area->pages, domain,
+					pfns.batch_start_index);
+out_destroy:
+	pfn_reader_destroy(&pfns);
+	return rc;
+}
+
+/**
+ * iopt_area_fill_domains() - Install PFNs into the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area creation. The area is freshly created and not inserted in
+ * the domains_itree yet. PFNs are read and loaded into every domain held in the
+ * area's io_pagetable and the area is installed in the domains_itree.
+ *
+ * On failure all domains are left unchanged.
+ */
+int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct pfn_reader pfns;
+	struct iommu_domain *domain;
+	unsigned long unmap_index;
+	unsigned long index;
+	int rc;
+
+	lockdep_assert_held(&area->iopt->domains_rwsem);
+
+	if (xa_empty(&area->iopt->domains))
+		return 0;
+
+	mutex_lock(&pages->mutex);
+	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
+			      iopt_area_last_index(area));
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		xa_for_each(&area->iopt->domains, index, domain) {
+			rc = batch_to_domain(&pfns.batch, domain, area,
+					     pfns.batch_start_index);
+			if (rc)
+				goto out_unmap;
+		}
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_unmap;
+	}
+	rc = pfn_reader_update_pinned(&pfns);
+	if (rc)
+		goto out_unmap;
+
+	area->storage_domain = xa_load(&area->iopt->domains, 0);
+	interval_tree_insert(&area->pages_node, &pages->domains_itree);
+	goto out_destroy;
+
+out_unmap:
+	xa_for_each(&area->iopt->domains, unmap_index, domain) {
+		unsigned long end_index = pfns.batch_start_index;
+
+		if (unmap_index <= index)
+			end_index = pfns.batch_end_index;
+
+		/*
+		 * The area is not yet part of the domains_itree so we have to
+		 * manage the unpinning specially. The last domain does the
+		 * unpin, every other domain is just unmapped.
+		 */
+		if (unmap_index != area->iopt->next_domain_id - 1) {
+			if (end_index != iopt_area_index(area))
+				iopt_area_unmap_domain_range(
+					area, domain, iopt_area_index(area),
+					end_index - 1);
+		} else {
+			iopt_area_unfill_partial_domain(area, pages, domain,
+							end_index);
+		}
+	}
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/**
+ * iopt_area_unfill_domains() - unmap PFNs from the area's domains
+ * @area: The area to act on
+ * @pages: The pages associated with the area (area->pages is NULL)
+ *
+ * Called during area destruction. This unmaps the iova's covered by all the
+ * area's domains and releases the PFNs.
+ */
+void iopt_area_unfill_domains(struct iopt_area *area, struct iopt_pages *pages)
+{
+	struct io_pagetable *iopt = area->iopt;
+	struct iommu_domain *domain;
+	unsigned long index;
+
+	lockdep_assert_held(&iopt->domains_rwsem);
+
+	mutex_lock(&pages->mutex);
+	if (!area->storage_domain)
+		goto out_unlock;
+
+	xa_for_each(&iopt->domains, index, domain)
+		if (domain != area->storage_domain)
+			iopt_area_unmap_domain_range(
+				area, domain, iopt_area_index(area),
+				iopt_area_last_index(area));
+
+	interval_tree_remove(&area->pages_node, &pages->domains_itree);
+	iopt_area_unfill_domain(area, pages, area->storage_domain);
+	area->storage_domain = NULL;
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}
+
+/**
+ * iopt_pages_unfill_xarray() - Update the xarry after removing an access
+ * @pages: The pages to act on
+ * @start_index: Starting PFN index
+ * @last_index: Last PFN index
+ *
+ * Called when an iopt_pages_access is removed, removes pages from the itree.
+ * The access should already be removed from the access_itree.
+ */
+void iopt_pages_unfill_xarray(struct iopt_pages *pages,
+			      unsigned long start_index,
+			      unsigned long last_index)
+{
+	struct interval_tree_double_span_iter span;
+	u64 backup[BATCH_BACKUP_SIZE];
+	struct pfn_batch batch;
+	bool batch_inited = false;
+
+	lockdep_assert_held(&pages->mutex);
+
+	interval_tree_for_each_double_span(&span, &pages->access_itree,
+					   &pages->domains_itree, start_index,
+					   last_index) {
+		if (!span.is_used) {
+			if (!batch_inited) {
+				batch_init_backup(&batch,
+						  last_index - start_index + 1,
+						  backup, sizeof(backup));
+				batch_inited = true;
+			}
+			batch_from_xarray_clear(&batch, &pages->pinned_pfns,
+						span.start_hole,
+						span.last_hole);
+			batch_unpin(&batch, pages, 0, batch.total_pfns);
+			batch_clear(&batch);
+		} else if (span.is_used == 2) {
+			/* Covered by a domain */
+			clear_xarray(&pages->pinned_pfns, span.start_used,
+				     span.last_used);
+		}
+		/* Otherwise covered by an existing access */
+	}
+	if (batch_inited)
+		batch_destroy(&batch, backup);
+	update_unpinned(pages);
+}
+
+/**
+ * iopt_pages_fill_from_xarray() - Fast path for reading PFNs
+ * @pages: The pages to act on
+ * @start_index: The first page index in the range
+ * @last_index: The last page index in the range
+ * @out_pages: The output array to return the pages
+ *
+ * This can be called if the caller is holding a refcount on an
+ * iopt_pages_access that is known to have already been filled. It quickly reads
+ * the pages directly from the xarray.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+void iopt_pages_fill_from_xarray(struct iopt_pages *pages,
+				 unsigned long start_index,
+				 unsigned long last_index,
+				 struct page **out_pages)
+{
+	XA_STATE(xas, &pages->pinned_pfns, start_index);
+	void *entry;
+
+	rcu_read_lock();
+	while (start_index <= last_index) {
+		entry = xas_next(&xas);
+		if (xas_retry(&xas, entry))
+			continue;
+		WARN_ON(!xa_is_value(entry));
+		*(out_pages++) = pfn_to_page(xa_to_value(entry));
+		start_index++;
+	}
+	rcu_read_unlock();
+}
+
+static int iopt_pages_fill_from_domain(struct iopt_pages *pages,
+				       unsigned long start_index,
+				       unsigned long last_index,
+				       struct page **out_pages)
+{
+	while (start_index != last_index + 1) {
+		unsigned long domain_last;
+		struct iopt_area *area;
+
+		area = iopt_pages_find_domain_area(pages, start_index);
+		if (WARN_ON(!area))
+			return -EINVAL;
+
+		domain_last = min(iopt_area_last_index(area), last_index);
+		out_pages = raw_pages_from_domain(area->storage_domain, area,
+						  start_index, domain_last,
+						  out_pages);
+		start_index = domain_last + 1;
+	}
+	return 0;
+}
+
+static int iopt_pages_fill_from_mm(struct iopt_pages *pages,
+				   struct pfn_reader_user *user,
+				   unsigned long start_index,
+				   unsigned long last_index,
+				   struct page **out_pages)
+{
+	unsigned long cur_index = start_index;
+	int rc;
+
+	while (cur_index != last_index + 1) {
+		user->upages = out_pages + (cur_index - start_index);
+		rc = pfn_reader_user_pin(user, pages, cur_index, last_index);
+		if (rc)
+			goto out_unpin;
+		cur_index = user->upages_end;
+	}
+	return 0;
+
+out_unpin:
+	if (start_index != cur_index)
+		iopt_pages_err_unpin(pages, start_index, cur_index - 1,
+				     out_pages);
+	return rc;
+}
+
+/**
+ * iopt_pages_fill_xarray() - Read PFNs
+ * @pages: The pages to act on
+ * @start_index: The first page index in the range
+ * @last_index: The last page index in the range
+ * @out_pages: The output array to return the pages, may be NULL
+ *
+ * This populates the xarray and returns the pages in out_pages. As the slow
+ * path this is able to copy pages from other storage tiers into the xarray.
+ *
+ * On failure the xarray is left unchanged.
+ *
+ * This is part of the SW iommu interface to read pages for in-kernel use.
+ */
+int iopt_pages_fill_xarray(struct iopt_pages *pages, unsigned long start_index,
+			   unsigned long last_index, struct page **out_pages)
+{
+	struct interval_tree_double_span_iter span;
+	unsigned long xa_end = start_index;
+	struct pfn_reader_user user;
+	int rc;
+
+	pfn_reader_user_init(&user, pages);
+	user.upages_len = last_index - start_index + 1;
+	interval_tree_for_each_double_span(&span, &pages->access_itree,
+					   &pages->domains_itree, start_index,
+					   last_index) {
+		struct page **cur_pages;
+
+		if (span.is_used == 1) {
+			cur_pages = out_pages + (span.start_used - start_index);
+			iopt_pages_fill_from_xarray(pages, span.start_used,
+						    span.last_used, cur_pages);
+			continue;
+		}
+
+		if (span.is_used == 2) {
+			cur_pages = out_pages + (span.start_used - start_index);
+			iopt_pages_fill_from_domain(pages, span.start_used,
+						    span.last_used, cur_pages);
+			rc = pages_to_xarray(&pages->pinned_pfns,
+					     span.start_used, span.last_used,
+					     cur_pages);
+			if (rc)
+				goto out_clean_xa;
+			xa_end = span.last_used + 1;
+			continue;
+		}
+
+		/* hole */
+		cur_pages = out_pages + (span.start_hole - start_index);
+		rc = iopt_pages_fill_from_mm(pages, &user, span.start_hole,
+					     span.last_hole, cur_pages);
+		if (rc)
+			goto out_clean_xa;
+		rc = pages_to_xarray(&pages->pinned_pfns, span.start_hole,
+				     span.last_hole, cur_pages);
+		if (rc) {
+			iopt_pages_err_unpin(pages, span.start_hole,
+					     span.last_hole, cur_pages);
+			goto out_clean_xa;
+		}
+		xa_end = span.last_hole + 1;
+	}
+	rc = pfn_reader_user_update_pinned(&user, pages);
+	if (rc)
+		goto out_clean_xa;
+	user.upages = NULL;
+	pfn_reader_user_destroy(&user, pages);
+	return 0;
+
+out_clean_xa:
+	if (start_index != xa_end)
+		iopt_pages_unfill_xarray(pages, start_index, xa_end - 1);
+	user.upages = NULL;
+	pfn_reader_user_destroy(&user, pages);
+	return rc;
+}
+
+/*
+ * This can do everything and is fully coherent with what a iommu_domain would
+ * see.
+ */
+static int iopt_pages_rw_slow(struct iopt_pages *pages,
+			      unsigned long start_index,
+			      unsigned long last_index, unsigned long offset,
+			      void *data, unsigned long length,
+			      unsigned int flags)
+{
+	struct pfn_reader pfns;
+	int rc;
+
+	mutex_lock(&pages->mutex);
+
+	rc = pfn_reader_first(&pfns, pages, start_index, last_index);
+	if (rc)
+		goto out_unlock;
+
+	while (!pfn_reader_done(&pfns)) {
+		unsigned long done;
+
+		done = batch_rw(&pfns.batch, data, offset, length, flags);
+		data += done;
+		length -= done;
+		offset = 0;
+		pfn_reader_unpin(&pfns);
+
+		rc = pfn_reader_next(&pfns);
+		if (rc)
+			goto out_destroy;
+	}
+	if (WARN_ON(length != 0))
+		rc = -EINVAL;
+out_destroy:
+	pfn_reader_destroy(&pfns);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+	return rc;
+}
+
+/*
+ * A medium speed path that still allows DMA decoherence, but doesn't do any
+ * memory allocations or interval tree searches.
+ */
+static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
+			      unsigned long offset, void *data,
+			      unsigned long length, unsigned int flags)
+{
+	struct page *page = NULL;
+	int rc;
+
+	if (!mmget_not_zero(pages->source_mm))
+		return iopt_pages_rw_slow(pages, index, index, offset, data,
+					  length, flags);
+
+	mmap_read_lock(pages->source_mm);
+	rc = pin_user_pages_remote(
+		pages->source_mm, (uintptr_t)(pages->uptr + index * PAGE_SIZE),
+		1, (flags & IOMMUFD_ACCESS_RW_WRITE) ? FOLL_WRITE : 0, &page,
+		NULL, NULL);
+	mmap_read_unlock(pages->source_mm);
+	if (rc != 1) {
+		if (WARN_ON(rc >= 0))
+			rc = -EINVAL;
+		goto out_mmput;
+	}
+	copy_data_page(page, data, offset, length, flags);
+	unpin_user_page(page);
+	rc = 0;
+
+out_mmput:
+	mmput(pages->source_mm);
+	return rc;
+}
+
+/**
+ * iopt_pages_rw_access - Copy to/from a linear slice of the pages
+ * @pages: pages to act on
+ * @start_byte: First byte of pages to copy to/from
+ * @data: Kernel buffer to get/put the data
+ * @length: Number of bytes to copy
+ * @flags: IOMMUFD_ACCESS_RW_* flags
+ *
+ * This will find each page in the range, kmap it and then memcpy to/from
+ * the given kernel buffer.
+ */
+int iopt_pages_rw_access(struct iopt_pages *pages, unsigned long start_byte,
+			 void *data, unsigned long length, unsigned int flags)
+{
+	unsigned long start_index = start_byte / PAGE_SIZE;
+	unsigned long last_index = (start_byte + length - 1) / PAGE_SIZE;
+	bool change_mm = current->mm != pages->source_mm;
+	int rc = 0;
+
+	if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
+		return -EPERM;
+
+	if (!(flags & IOMMUFD_ACCESS_RW_KTHREAD) && change_mm) {
+		if (start_index == last_index)
+			return iopt_pages_rw_page(pages, start_index,
+						  start_byte % PAGE_SIZE, data,
+						  length, flags);
+		return iopt_pages_rw_slow(pages, start_index, last_index,
+					  start_byte % PAGE_SIZE, data, length,
+					  flags);
+	}
+
+	/*
+	 * Try to copy using copy_to_user(). We do this as a fast path and
+	 * ignore any pinning decoherence, unlike a real DMA path.
+	 */
+	if (change_mm) {
+		if (!mmget_not_zero(pages->source_mm))
+			return iopt_pages_rw_slow(pages, start_index,
+						  last_index,
+						  start_byte % PAGE_SIZE, data,
+						  length, flags);
+		kthread_use_mm(pages->source_mm);
+	}
+
+	if (flags & IOMMUFD_ACCESS_RW_WRITE) {
+		if (copy_to_user(pages->uptr + start_byte, data, length))
+			rc = -EFAULT;
+	} else {
+		if (copy_from_user(data, pages->uptr + start_byte, length))
+			rc = -EFAULT;
+	}
+
+	if (change_mm) {
+		kthread_unuse_mm(pages->source_mm);
+		mmput(pages->source_mm);
+	}
+
+	return rc;
+}
+
+static struct iopt_pages_access *
+iopt_pages_get_exact_access(struct iopt_pages *pages, unsigned long index,
+			    unsigned long last)
+{
+	struct interval_tree_node *node;
+
+	lockdep_assert_held(&pages->mutex);
+
+	/* There can be overlapping ranges in this interval tree */
+	for (node = interval_tree_iter_first(&pages->access_itree, index, last);
+	     node; node = interval_tree_iter_next(node, index, last))
+		if (node->start == index && node->last == last)
+			return container_of(node, struct iopt_pages_access,
+					    node);
+	return NULL;
+}
+
+/**
+ * iopt_pages_add_access() - Record an in-knerel access for PFNs
+ * @pages: The source of PFNs
+ * @start_index: First page index
+ * @last_index: Inclusive last page index
+ * @out_pages: Output list of struct page's representing the PFNs
+ * @flags: IOMMUFD_ACCESS_RW_* flags
+ *
+ * Record that an in-kernel access will be accessing the pages, ensure they are
+ * pinned, and return the PFNs as a simple list of 'struct page *'.
+ *
+ * This should be undone through a matching call to iopt_pages_remove_access()
+ */
+int iopt_pages_add_access(struct iopt_pages *pages, unsigned long start_index,
+			  unsigned long last_index, struct page **out_pages,
+			  unsigned int flags)
+{
+	struct iopt_pages_access *access;
+	int rc;
+
+	if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
+		return -EPERM;
+
+	access = iopt_pages_get_exact_access(pages, start_index, last_index);
+	if (access) {
+		refcount_inc(&access->refcount);
+		iopt_pages_fill_from_xarray(pages, start_index, last_index,
+					    out_pages);
+		return 0;
+	}
+
+	access = kzalloc(sizeof(*access), GFP_KERNEL_ACCOUNT);
+	if (!access)
+		return -ENOMEM;
+
+	rc = iopt_pages_fill_xarray(pages, start_index, last_index, out_pages);
+	if (rc)
+		goto out_free;
+
+	access->node.start = start_index;
+	access->node.last = last_index;
+	refcount_set(&access->refcount, 1);
+	interval_tree_insert(&access->node, &pages->access_itree);
+	return 0;
+
+out_free:
+	kfree(access);
+	return rc;
+}
+
+/**
+ * iopt_pages_remove_access() - Release an in-kernel access for PFNs
+ * @area: The source of PFNs
+ * @start_index: First page index
+ * @last_index: Inclusive last page index
+ *
+ * Undo iopt_pages_add_access() and unpin the pages if necessary. The caller
+ * must stop using the PFNs before calling this.
+ */
+void iopt_pages_remove_access(struct iopt_area *area, unsigned long start_index,
+			      unsigned long last_index)
+{
+	struct iopt_pages_access *access;
+	struct iopt_pages *pages = area->pages;
+
+	mutex_lock(&pages->mutex);
+	access = iopt_pages_get_exact_access(pages, start_index, last_index);
+	if (WARN_ON(!access))
+		goto out_unlock;
+
+	WARN_ON(area->num_accesses == 0);
+	area->num_accesses--;
+
+	if (!refcount_dec_and_test(&access->refcount))
+		goto out_unlock;
+
+	interval_tree_remove(&access->node, &pages->access_itree);
+	iopt_pages_unfill_xarray(pages, start_index, last_index);
+	kfree(access);
+out_unlock:
+	mutex_unlock(&pages->mutex);
+}