Message ID | 1360608604-3520-1-git-send-email-imre.deak@intel.com |
---|---|
State | New |
Headers | show |
On Mon, 11 Feb 2013 20:50:04 +0200 Imre Deak <imre.deak@intel.com> wrote: > Add an iterator to walk through a scatter list a page at a time starting > at a specific page offset. As opposed to the mapping iterator this is What is "the mapping iterator"? > meant to be small, performing well even in simple loops like collecting > all pages on the scatterlist into an array or setting up an iommu table > based on the pages' DMA address. Where will this new macro be used? What is driving this effort? > v2: > - In each iteration sg_pgoffset pointed incorrectly at the next page not > the current one. > > Signed-off-by: Imre Deak <imre.deak@intel.com> > --- > include/linux/scatterlist.h | 50 +++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 50 insertions(+) > > diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h > index 4bd6c06..72578b5 100644 > --- a/include/linux/scatterlist.h > +++ b/include/linux/scatterlist.h > @@ -231,6 +231,56 @@ size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents, > */ > #define SG_MAX_SINGLE_ALLOC (PAGE_SIZE / sizeof(struct scatterlist)) > > +struct sg_page_iter { > + struct scatterlist *sg; > + int sg_pgoffset; > + struct page *page; > +}; Some documentation wouldn't hurt. What it's used for, why it exists. > +static inline int > +sg_page_cnt(struct scatterlist *sg) unneeded newline here. A more typical name would be "sg_page_count". Stripping words of their vowels makes the symbols harder to remember. > +{ > + BUG_ON(sg->offset || sg->length & ~PAGE_MASK); > + > + return sg->length >> PAGE_SHIFT; > +} > + > +static inline struct page * > +sg_page_iter_get_page(struct sg_page_iter *iter) > +{ > + while (iter->sg && iter->sg_pgoffset >= sg_page_cnt(iter->sg)) { > + iter->sg_pgoffset -= sg_page_cnt(iter->sg); > + iter->sg = sg_next(iter->sg); > + } > + > + return iter->sg ? nth_page(sg_page(iter->sg), iter->sg_pgoffset) : NULL; > +} > + > +static inline void > +sg_page_iter_next(struct sg_page_iter *iter) > +{ > + iter->sg_pgoffset++; > + iter->page = sg_page_iter_get_page(iter); > +} > + > +static inline void > +sg_page_iter_start(struct sg_page_iter *iter, struct scatterlist *sglist, > + unsigned long pgoffset) > +{ > + iter->sg = sglist; > + iter->sg_pgoffset = pgoffset; > + iter->page = sg_page_iter_get_page(iter); > +} All the above are undocumented also. I guess that's acceptable if they are only ever to be used by for_each_sg_page(). Although if that's the case then perhaps the identifiers should be a bit more obscure-looking. Usually we prefix them with "__" to say "this is in internal thing". > +/* > + * Simple sg page iterator, starting off at the given page offset. Each entry > + * on the sglist must start at offset 0 and can contain only full pages. > + * iter->page will point to the current page, iter->sg_pgoffset to the page > + * offset within the sg holding that page. > + */ > +#define for_each_sg_page(sglist, iter, pgoffset) \ > + for (sg_page_iter_start((iter), (sglist), (pgoffset)); \ > + (iter)->page; sg_page_iter_next(iter)) Because all the helper functions are inlined, this will expand to a quite large amount of code. And large code can be slow code due to I-cache eviction. I don't know *how* big this thing will be because the patch didn't include a caller and I can't be bothered writing my own. (And the lack of any caller means that the code will not be tested). So, exactly how big is this thing, and how do we know it's better this way than if we were to uninline some/all of the helpers?
On Mon, 2013-02-11 at 12:54 -0800, Andrew Morton wrote: > On Mon, 11 Feb 2013 20:50:04 +0200 > Imre Deak <imre.deak@intel.com> wrote: > > > Add an iterator to walk through a scatter list a page at a time starting > > at a specific page offset. As opposed to the mapping iterator this is > > What is "the mapping iterator"? It's the one implemented by sg_miter_{start,stop} in scatterlist.c. It also iterates through a scatterlist a page at a time, but it also kmaps these pages. Since in our use case we don't need to map the pages we needed a solution without this overhead. > > meant to be small, performing well even in simple loops like collecting > > all pages on the scatterlist into an array or setting up an iommu table > > based on the pages' DMA address. > > Where will this new macro be used? What is driving this effort? At the moment the only user of the macro would be the i915 driver, see [1] for the patches that takes it into use. In the patchset the macro was added as a DRM specific macro, but since it might be useful in the future for other drivers too (anything using dma-buf) I'd like to add it to a more generic place. > > v2: > > - In each iteration sg_pgoffset pointed incorrectly at the next page not > > the current one. > > > > Signed-off-by: Imre Deak <imre.deak@intel.com> > > --- > > include/linux/scatterlist.h | 50 +++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 50 insertions(+) > > > > diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h > > index 4bd6c06..72578b5 100644 > > --- a/include/linux/scatterlist.h > > +++ b/include/linux/scatterlist.h > > @@ -231,6 +231,56 @@ size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents, > > */ > > #define SG_MAX_SINGLE_ALLOC (PAGE_SIZE / sizeof(struct scatterlist)) > > > > +struct sg_page_iter { > > + struct scatterlist *sg; > > + int sg_pgoffset; > > + struct page *page; > > +}; > > Some documentation wouldn't hurt. What it's used for, why it exists. Ok, will add it. > > > +static inline int > > +sg_page_cnt(struct scatterlist *sg) > > unneeded newline here. > > A more typical name would be "sg_page_count". Stripping words of their > vowels makes the symbols harder to remember. Ok, will fix this. > > +{ > > + BUG_ON(sg->offset || sg->length & ~PAGE_MASK); > > + > > + return sg->length >> PAGE_SHIFT; > > +} > > + > > +static inline struct page * > > +sg_page_iter_get_page(struct sg_page_iter *iter) > > +{ > > + while (iter->sg && iter->sg_pgoffset >= sg_page_cnt(iter->sg)) { > > + iter->sg_pgoffset -= sg_page_cnt(iter->sg); > > + iter->sg = sg_next(iter->sg); > > + } > > + > > + return iter->sg ? nth_page(sg_page(iter->sg), iter->sg_pgoffset) : NULL; > > +} > > + > > +static inline void > > +sg_page_iter_next(struct sg_page_iter *iter) > > +{ > > + iter->sg_pgoffset++; > > + iter->page = sg_page_iter_get_page(iter); > > +} > > + > > +static inline void > > +sg_page_iter_start(struct sg_page_iter *iter, struct scatterlist *sglist, > > + unsigned long pgoffset) > > +{ > > + iter->sg = sglist; > > + iter->sg_pgoffset = pgoffset; > > + iter->page = sg_page_iter_get_page(iter); > > +} > > All the above are undocumented also. I guess that's acceptable if they > are only ever to be used by for_each_sg_page(). Although if that's the > case then perhaps the identifiers should be a bit more obscure-looking. > Usually we prefix them with "__" to say "this is in internal thing". Yes, they are meant to be used only internally, so I'll add the __ prefix. > > +/* > > + * Simple sg page iterator, starting off at the given page offset. Each entry > > + * on the sglist must start at offset 0 and can contain only full pages. > > + * iter->page will point to the current page, iter->sg_pgoffset to the page > > + * offset within the sg holding that page. > > + */ > > +#define for_each_sg_page(sglist, iter, pgoffset) \ > > + for (sg_page_iter_start((iter), (sglist), (pgoffset)); \ > > + (iter)->page; sg_page_iter_next(iter)) > > Because all the helper functions are inlined, this will expand to a > quite large amount of code. And large code can be slow code due to > I-cache eviction. > > I don't know *how* big this thing will be because the patch didn't > include a caller and I can't be bothered writing my own. (And the lack > of any caller means that the code will not be tested). > > So, exactly how big is this thing, and how do we know it's better this > way than if we were to uninline some/all of the helpers? I admit I only hoped compiler optimization would keep the inlined parts at a minimum, but now I actually checked (on Intel CPU). I applied the patchset from [1] and uninlined sg_page_iter_start as it's not significant for speed: size drivers/gpu/drm/i915/i915.ko 514855 15996 272 531123 81ab3 drivers/gpu/drm/i915/i915.ko Then uninlined all helpers: size drivers/gpu/drm/i915/i915.ko 513447 15996 272 529715 81533 drivers/gpu/drm/i915/i915.ko Since there are 8 invocations of the macro, the overhead for a single invocation is about (531123 - 529715) / 8 = 191 bytes. For speed, I benchmarked a simple loop which was basically: page = vmalloc(sizeof(*page) * 1000, GFP_KERNEL); for_each_sg_page(sglist, iter, 0) *page++ = iter.page; where each entry on the sglist contained 16 consecutive pages. This takes ~10% more time for the uninlined version to run. This is a rather artificial test and I couldn't come up with something more real-life using only the i915 driver's ioctl interface that would show a significant change in speed. So at least for now I'm ok with just uninlining all the helpers. Thanks for the review, Imre [1] http://lists.freedesktop.org/archives/intel-gfx/2013-February/024589.html
Hello, On Tue, Feb 12, 2013 at 07:07:20PM +0200, Imre Deak wrote: > It's the one implemented by sg_miter_{start,stop} in scatterlist.c. It > also iterates through a scatterlist a page at a time, but it also kmaps > these pages. Since in our use case we don't need to map the pages we > needed a solution without this overhead. I'm not against having non-mapping iterator but please consider that kmaps are no-ops on many configurations. It matters only for archs w/ high memory. > where each entry on the sglist contained 16 consecutive pages. This > takes ~10% more time for the uninlined version to run. This is a rather > artificial test and I couldn't come up with something more real-life > using only the i915 driver's ioctl interface that would show a > significant change in speed. > > So at least for now I'm ok with just uninlining all the helpers. Can we reimplement mapping iters using the new ones? Thanks.
On Tue, 12 Feb 2013 19:07:20 +0200 Imre Deak <imre.deak@intel.com> wrote: > > So, exactly how big is this thing, and how do we know it's better this > > way than if we were to uninline some/all of the helpers? > > I admit I only hoped compiler optimization would keep the inlined parts > at a minimum, but now I actually checked (on Intel CPU). I applied the > patchset from [1] and uninlined sg_page_iter_start as it's not > significant for speed: > > size drivers/gpu/drm/i915/i915.ko > 514855 15996 272 531123 81ab3 drivers/gpu/drm/i915/i915.ko > > Then uninlined all helpers: > size drivers/gpu/drm/i915/i915.ko > 513447 15996 272 529715 81533 drivers/gpu/drm/i915/i915.ko > > Since there are 8 invocations of the macro, the overhead for a single > invocation is about (531123 - 529715) / 8 = 191 bytes. > > For speed, I benchmarked a simple loop which was basically: > > page = vmalloc(sizeof(*page) * 1000, GFP_KERNEL); > for_each_sg_page(sglist, iter, 0) > *page++ = iter.page; > > where each entry on the sglist contained 16 consecutive pages. This > takes ~10% more time for the uninlined version to run. This is a rather > artificial test and I couldn't come up with something more real-life > using only the i915 driver's ioctl interface that would show a > significant change in speed. 10% for the function call overhead sounds reasonable. Of course, that test is biased in one direction. A test which was biased in the other direction would exercise all eight of the macro's callsites and would investigate the performance impact of a 1kbyte increase in L1 cache utilisation. And I must say, it would need to be a pretty damn carefully crafted test case to be able to trigger enough cache thrashing to cause a 10% hit. > So at least for now I'm ok with just uninlining all the helpers. OK.
On Tue, 2013-02-12 at 09:13 -0800, Tejun Heo wrote: > Hello, > > On Tue, Feb 12, 2013 at 07:07:20PM +0200, Imre Deak wrote: > > It's the one implemented by sg_miter_{start,stop} in scatterlist.c. It > > also iterates through a scatterlist a page at a time, but it also kmaps > > these pages. Since in our use case we don't need to map the pages we > > needed a solution without this overhead. > > I'm not against having non-mapping iterator but please consider that > kmaps are no-ops on many configurations. It matters only for archs w/ > high memory. Ok, I haven't thought about that. But in any case we care about those archs too and would like to avoid the mapping there as well. > > where each entry on the sglist contained 16 consecutive pages. This > > takes ~10% more time for the uninlined version to run. This is a rather > > artificial test and I couldn't come up with something more real-life > > using only the i915 driver's ioctl interface that would show a > > significant change in speed. > > > > So at least for now I'm ok with just uninlining all the helpers. > > Can we reimplement mapping iters using the new ones? Yes I think it's a good idea, I will follow up with a new patchset addressing this and Andrew's comments. Thanks, Imre
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 4bd6c06..72578b5 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -231,6 +231,56 @@ size_t sg_copy_to_buffer(struct scatterlist *sgl, unsigned int nents, */ #define SG_MAX_SINGLE_ALLOC (PAGE_SIZE / sizeof(struct scatterlist)) +struct sg_page_iter { + struct scatterlist *sg; + int sg_pgoffset; + struct page *page; +}; + +static inline int +sg_page_cnt(struct scatterlist *sg) +{ + BUG_ON(sg->offset || sg->length & ~PAGE_MASK); + + return sg->length >> PAGE_SHIFT; +} + +static inline struct page * +sg_page_iter_get_page(struct sg_page_iter *iter) +{ + while (iter->sg && iter->sg_pgoffset >= sg_page_cnt(iter->sg)) { + iter->sg_pgoffset -= sg_page_cnt(iter->sg); + iter->sg = sg_next(iter->sg); + } + + return iter->sg ? nth_page(sg_page(iter->sg), iter->sg_pgoffset) : NULL; +} + +static inline void +sg_page_iter_next(struct sg_page_iter *iter) +{ + iter->sg_pgoffset++; + iter->page = sg_page_iter_get_page(iter); +} + +static inline void +sg_page_iter_start(struct sg_page_iter *iter, struct scatterlist *sglist, + unsigned long pgoffset) +{ + iter->sg = sglist; + iter->sg_pgoffset = pgoffset; + iter->page = sg_page_iter_get_page(iter); +} + +/* + * Simple sg page iterator, starting off at the given page offset. Each entry + * on the sglist must start at offset 0 and can contain only full pages. + * iter->page will point to the current page, iter->sg_pgoffset to the page + * offset within the sg holding that page. + */ +#define for_each_sg_page(sglist, iter, pgoffset) \ + for (sg_page_iter_start((iter), (sglist), (pgoffset)); \ + (iter)->page; sg_page_iter_next(iter)) /* * Mapping sg iterator
Add an iterator to walk through a scatter list a page at a time starting at a specific page offset. As opposed to the mapping iterator this is meant to be small, performing well even in simple loops like collecting all pages on the scatterlist into an array or setting up an iommu table based on the pages' DMA address. v2: - In each iteration sg_pgoffset pointed incorrectly at the next page not the current one. Signed-off-by: Imre Deak <imre.deak@intel.com> --- include/linux/scatterlist.h | 50 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+)