Message ID | 1469073189-9167-7-git-send-email-john.stultz@linaro.org |
---|---|
State | New |
Headers | show |
On 07/21/2016 11:53 AM, John Stultz wrote: > After lots of debugging on an occasional DMA ERR issue, I realized > that the desc structures which we point the dma hardware are being > allocated out of regular memory. This means when we fill the desc > structures, that data doesn't always get flushed out to memory by > the time we start the dma transfer, resulting in the dma engine getting > some null values, resulting in a DMA ERR on the first irq. How about using wmb() flush before start dma to sync desc? I remember I used dma_pool first, then do some optimization referring Russell's driver. Thanks
On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> wrote: > > > On 07/21/2016 11:53 AM, John Stultz wrote: >> >> After lots of debugging on an occasional DMA ERR issue, I realized >> that the desc structures which we point the dma hardware are being >> allocated out of regular memory. This means when we fill the desc >> structures, that data doesn't always get flushed out to memory by >> the time we start the dma transfer, resulting in the dma engine getting >> some null values, resulting in a DMA ERR on the first irq. > > > How about using wmb() flush before start dma to sync desc? So I'm not going to pretend to be an expert here, but my understanding is that wmb() syncrhonizes cpu write ordering operations across cpus, so the cpus see all the changes before the wmb() before they see any changes after. But I'm not sure what effect wmb() has across cpu cache to device ordering. I don't think it works as a cache flush to memory. Andy's patch introducing the cyclic support actually had a wmb() in it that I removed as I couldn't understand clearly why it was there (and there wasn't a comment explaining, as required by checkpatch :). But even with that wmb(), the DMA ERR was still seen. Only with these two new changes have I gotten to the point where I can't seem to trigger the DMA error. thanks -john
On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote: >On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> >wrote: >> >> >> On 07/21/2016 11:53 AM, John Stultz wrote: >>> >>> After lots of debugging on an occasional DMA ERR issue, I realized >>> that the desc structures which we point the dma hardware are being >>> allocated out of regular memory. This means when we fill the desc >>> structures, that data doesn't always get flushed out to memory by >>> the time we start the dma transfer, resulting in the dma engine >getting >>> some null values, resulting in a DMA ERR on the first irq. >> >> >> How about using wmb() flush before start dma to sync desc? > >So I'm not going to pretend to be an expert here, but my understanding >is that wmb() syncrhonizes cpu write ordering operations across cpus, IIUI what the memory barrier does is tell the *compiler* to actually do any writes that the code asked for, but which otherwise might actually be deferred past that point. The compiler doesn't know that buffer area has other hardware snooping it, so by default it feels it can play tricks with what seems to it like just generally deferring spilling registers to memory. wmb makes sure the compiler's pending writes actually happen right there. (writel() etc definitions have one built-in, so they always do what you asked when you asked). That can be of interest when syncing with other cores but also to other IPs that intend to use the written-to area immediately, which is what's happening here. Without the barrier the write may not be issued anywhere, even to local cpu cache, until after the hw is asked to go read the buffer, corrupting what the hw sees. >so the cpus see all the changes before the wmb() before they see any >changes after. But I'm not sure what effect wmb() has across cpu >cache to device ordering. I don't think it works as a cache flush to >memory. > >Andy's patch introducing the cyclic support actually had a wmb() in it >that I removed as I couldn't understand clearly why it was there (and >there wasn't a comment explaining, as required by checkpatch :). But >even with that wmb(), the DMA ERR was still seen. The rule about the comment is there partially because there's a general tendancy for throwing voodoo smbs on broken things in case it helps. But writing out memory descriptors that other hw is going to read is a legit use for it I think. If the compiler you use wasn't deferring the write, you won't notice any difference removing it, but that doesn't mean it shouldn't be there. >Only with these two new changes have I gotten to the point where I >can't seem to trigger the DMA error. Sounds good... -Andy >thanks >-john
On Thu, Jul 21, 2016 at 02:27:02PM +0800, Andy Green wrote: > On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote: > >On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> Please fix your mail client to word wrap within paragraphs at something substantially less than 80 columns. Doing this makes your messages much easier to read and reply to. > >> How about using wmb() flush before start dma to sync desc? > >So I'm not going to pretend to be an expert here, but my understanding > >is that wmb() syncrhonizes cpu write ordering operations across cpus, > IIUI what the memory barrier does is tell the *compiler* to actually > do any writes that the code asked for, but which otherwise might > actually be deferred past that point. The compiler doesn't know that > buffer area has other hardware snooping it, so by default it feels it > can play tricks with what seems to it like just generally deferring > spilling registers to memory. wmb makes sure the compiler's pending > writes actually happen right there. (writel() etc definitions have > one built-in, so they always do what you asked when you asked). You might be interested in Mark Rutland's talk from ELC (Stale data, or how we (mis-)manage modern caches): http://events.linuxfoundation.org/sites/events/files/slides/slides_17.pdf https://www.youtube.com/watch?v=F0SlIMHRnLk
On Thu, 2016-07-21 at 11:40 +0100, Mark Brown wrote: > On Thu, Jul 21, 2016 at 02:27:02PM +0800, Andy Green wrote: > > > > On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@lin > > aro.org> wrote: > > > > > > On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.or > > > g> > > Please fix your mail client to word wrap within paragraphs at > something > substantially less than 80 columns. Doing this makes your messages > much > easier to read and reply to. > > > > > > > > > > > > > > How about using wmb() flush before start dma to sync desc? > > > > > > > > > So I'm not going to pretend to be an expert here, but my > > > understanding > > > is that wmb() syncrhonizes cpu write ordering operations across > > > cpus, > > > > > IIUI what the memory barrier does is tell the *compiler* to > > actually > > do any writes that the code asked for, but which otherwise might > > actually be deferred past that point. The compiler doesn't know > > that > > buffer area has other hardware snooping it, so by default it feels > > it > > can play tricks with what seems to it like just generally deferring > > spilling registers to memory. wmb makes sure the compiler's > > pending > > writes actually happen right there. (writel() etc definitions have > > one built-in, so they always do what you asked when you asked). > > You might be interested in Mark Rutland's talk from ELC (Stale data, > or > how we (mis-)manage modern caches): > > http://events.linuxfoundation.org/sites/events/files/slides/slide > s_17.pdf Thanks for the pointer. That is a somewhat different animal to wmb though: wmb is about managing when the initiator of the write actualizes it, prior to that the write does not instantenously exist anywhere and so is not subject to these coherency nightmares [1]. Also the presentation is from the hw POV only, at the Linux driver level most of the considerations can be made moot if you just use the generic Linux DMA or related apis, which thanks to the work of the Arm Linux gods already knows how to deal with / wrap the issues, plus or minus. So it's not as dire as it sounds. -Andy [1] The details of some of those nightmares are unfortunately very familiar to me, since I spent many months where Linux was being blamed for Mali breakage via CCI on a platform that ultimately had its problems resolved by tweaks in its secure world... > https://www.youtube.com/watch?v=F0SlIMHRnLk
On 07/21/2016 01:22 PM, John Stultz wrote: > On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> wrote: >> >> >> On 07/21/2016 11:53 AM, John Stultz wrote: >>> >>> After lots of debugging on an occasional DMA ERR issue, I realized >>> that the desc structures which we point the dma hardware are being >>> allocated out of regular memory. This means when we fill the desc >>> structures, that data doesn't always get flushed out to memory by >>> the time we start the dma transfer, resulting in the dma engine getting >>> some null values, resulting in a DMA ERR on the first irq. >> >> >> How about using wmb() flush before start dma to sync desc? > > So I'm not going to pretend to be an expert here, but my understanding > is that wmb() syncrhonizes cpu write ordering operations across cpus, > so the cpus see all the changes before the wmb() before they see any > changes after. But I'm not sure what effect wmb() has across cpu > cache to device ordering. I don't think it works as a cache flush to > memory. > > Andy's patch introducing the cyclic support actually had a wmb() in it > that I removed as I couldn't understand clearly why it was there (and > there wasn't a comment explaining, as required by checkpatch :). But > even with that wmb(), the DMA ERR was still seen. > > Only with these two new changes have I gotten to the point where I > can't seem to trigger the DMA error. > Yes, you are right. Have double checked, we have to use non-cached memory here as dma descriptor, instead of cached memory from kzalloc. And barrier (wmb or writel) is used to ensure descriptor are written before start dma. Though we start dma much later in issue_pending -> tasklet, so the chance is low. Thanks
On Wed, Jul 20, 2016 at 11:27 PM, Andy Green <andy@warmcat.com> wrote: > On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote: >>On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> >>wrote: >>> >>> >>> On 07/21/2016 11:53 AM, John Stultz wrote: >>>> >>>> After lots of debugging on an occasional DMA ERR issue, I realized >>>> that the desc structures which we point the dma hardware are being >>>> allocated out of regular memory. This means when we fill the desc >>>> structures, that data doesn't always get flushed out to memory by >>>> the time we start the dma transfer, resulting in the dma engine >>getting >>>> some null values, resulting in a DMA ERR on the first irq. >>> >>> >>> How about using wmb() flush before start dma to sync desc? >> >>So I'm not going to pretend to be an expert here, but my understanding >>is that wmb() syncrhonizes cpu write ordering operations across cpus, > > IIUI what the memory barrier does is tell the *compiler* to actually do any writes that the code asked for, but which otherwise might actually be deferred past that point. The compiler doesn't know that buffer area has other hardware snooping it, so by default it feels it can play tricks with what seems to it like just generally deferring spilling registers to memory. wmb makes sure the compiler's pending writes actually happen right there. (writel() etc definitions have one built-in, so they always do what you asked when you asked). > > That can be of interest when syncing with other cores but also to other IPs that intend to use the written-to area immediately, which is what's happening here. Without the barrier the write may not be issued anywhere, even to local cpu cache, until after the hw is asked to go read the buffer, corrupting what the hw sees. > >>so the cpus see all the changes before the wmb() before they see any >>changes after. But I'm not sure what effect wmb() has across cpu >>cache to device ordering. I don't think it works as a cache flush to >>memory. >> >>Andy's patch introducing the cyclic support actually had a wmb() in it >>that I removed as I couldn't understand clearly why it was there (and >>there wasn't a comment explaining, as required by checkpatch :). But >>even with that wmb(), the DMA ERR was still seen. > > The rule about the comment is there partially because there's a general tendancy for throwing voodoo smbs on broken things in case it helps. But writing out memory descriptors that other hw is going to read is a legit use for it I think. If the compiler you use wasn't deferring the write, you won't notice any difference removing it, but that doesn't mean it shouldn't be there. > Though from your comments above, wouldn't using writel() instead of writel_relaxed(), followed by a wmb() be sufficient? thanks -john
On July 22, 2016 12:18:48 AM GMT+08:00, John Stultz <john.stultz@linaro.org> wrote: >On Wed, Jul 20, 2016 at 11:27 PM, Andy Green <andy@warmcat.com> wrote: >> On July 21, 2016 1:22:02 PM GMT+08:00, John Stultz ><john.stultz@linaro.org> wrote: >>>On Wed, Jul 20, 2016 at 9:26 PM, zhangfei <zhangfei.gao@linaro.org> >>>wrote: >>>> >>>> >>>> On 07/21/2016 11:53 AM, John Stultz wrote: >>>>> >>>>> After lots of debugging on an occasional DMA ERR issue, I realized >>>>> that the desc structures which we point the dma hardware are being >>>>> allocated out of regular memory. This means when we fill the desc >>>>> structures, that data doesn't always get flushed out to memory by >>>>> the time we start the dma transfer, resulting in the dma engine >>>getting >>>>> some null values, resulting in a DMA ERR on the first irq. >>>> >>>> >>>> How about using wmb() flush before start dma to sync desc? >>> >>>So I'm not going to pretend to be an expert here, but my >understanding >>>is that wmb() syncrhonizes cpu write ordering operations across cpus, >> >> IIUI what the memory barrier does is tell the *compiler* to actually >do any writes that the code asked for, but which otherwise might >actually be deferred past that point. The compiler doesn't know that >buffer area has other hardware snooping it, so by default it feels it >can play tricks with what seems to it like just generally deferring >spilling registers to memory. wmb makes sure the compiler's pending >writes actually happen right there. (writel() etc definitions have one >built-in, so they always do what you asked when you asked). >> >> That can be of interest when syncing with other cores but also to >other IPs that intend to use the written-to area immediately, which is >what's happening here. Without the barrier the write may not be issued >anywhere, even to local cpu cache, until after the hw is asked to go >read the buffer, corrupting what the hw sees. >> >>>so the cpus see all the changes before the wmb() before they see any >>>changes after. But I'm not sure what effect wmb() has across cpu >>>cache to device ordering. I don't think it works as a cache flush >to >>>memory. >>> >>>Andy's patch introducing the cyclic support actually had a wmb() in >it >>>that I removed as I couldn't understand clearly why it was there (and >>>there wasn't a comment explaining, as required by checkpatch :). >But >>>even with that wmb(), the DMA ERR was still seen. >> >> The rule about the comment is there partially because there's a >general tendancy for throwing voodoo smbs on broken things in case it >helps. But writing out memory descriptors that other hw is going to >read is a legit use for it I think. If the compiler you use wasn't >deferring the write, you won't notice any difference removing it, but >that doesn't mean it shouldn't be there. >> > >Though from your comments above, wouldn't using writel() instead of >writel_relaxed(), followed by a wmb() be sufficient? Yes, since on Arm writel() sticks a wmb (ultimately a dsb instruction + outer_sync()) after every write it does. -Andy >thanks >-john
diff --git a/drivers/dma/k3dma.c b/drivers/dma/k3dma.c index 950ed36..259bd3b 100644 --- a/drivers/dma/k3dma.c +++ b/drivers/dma/k3dma.c @@ -8,6 +8,8 @@ */ #include <linux/sched.h> #include <linux/device.h> +#include <linux/dma-mapping.h> +#include <linux/dmapool.h> #include <linux/dmaengine.h> #include <linux/init.h> #include <linux/interrupt.h> @@ -26,6 +28,7 @@ #define DRIVER_NAME "k3-dma" #define DMA_MAX_SIZE 0x1ffc #define DMA_CYCLIC_MAX_PERIOD 0x1000 +#define LLI_BLOCK_SIZE (4 * PAGE_SIZE) #define INT_STAT 0x00 #define INT_TC1 0x04 @@ -74,7 +77,7 @@ struct k3_dma_desc_sw { dma_addr_t desc_hw_lli; size_t desc_num; size_t size; - struct k3_desc_hw desc_hw[0]; + struct k3_desc_hw *desc_hw; }; struct k3_dma_phy; @@ -107,6 +110,7 @@ struct k3_dma_dev { struct k3_dma_phy *phy; struct k3_dma_chan *chans; struct clk *clk; + struct dma_pool *pool; u32 dma_channels; u32 dma_requests; }; @@ -434,6 +438,35 @@ static void k3_dma_fill_desc(struct k3_dma_desc_sw *ds, dma_addr_t dst, ds->desc_hw[num].config = ccfg; } +static struct k3_dma_desc_sw *k3_dma_alloc_desc_resource(int num, + struct dma_chan *chan) +{ + struct k3_dma_chan *c = to_k3_chan(chan); + struct k3_dma_desc_sw *ds; + struct k3_dma_dev *d = to_k3_dma(chan->device); + int lli_limit = LLI_BLOCK_SIZE / sizeof(struct k3_desc_hw); + + if (num > lli_limit) { + dev_dbg(chan->device->dev, "vch %p: sg num %d exceed max %d\n", + &c->vc, num, lli_limit); + return NULL; + } + + ds = kzalloc(sizeof(*ds), GFP_ATOMIC); + if (!ds) + return NULL; + + ds->desc_hw = dma_pool_alloc(d->pool, GFP_NOWAIT, &ds->desc_hw_lli); + if (!ds->desc_hw) { + dev_dbg(chan->device->dev, "vch %p: dma alloc fail\n", &c->vc); + kfree(ds); + return NULL; + } + memset(ds->desc_hw, 0, sizeof(struct k3_desc_hw) * num); + ds->desc_num = num; + return ds; +} + static struct dma_async_tx_descriptor *k3_dma_prep_memcpy( struct dma_chan *chan, dma_addr_t dst, dma_addr_t src, size_t len, unsigned long flags) @@ -447,15 +480,14 @@ static struct dma_async_tx_descriptor *k3_dma_prep_memcpy( return NULL; num = DIV_ROUND_UP(len, DMA_MAX_SIZE); - ds = kzalloc(sizeof(*ds) + num * sizeof(ds->desc_hw[0]), GFP_ATOMIC); + + ds = k3_dma_alloc_desc_resource(num, chan); if (!ds) { dev_dbg(chan->device->dev, "vchan %p: kzalloc fail\n", &c->vc); return NULL; } c->cyclic = 0; - ds->desc_hw_lli = __virt_to_phys((unsigned long)&ds->desc_hw[0]); ds->size = len; - ds->desc_num = num; num = 0; if (!c->ccfg) { @@ -506,12 +538,9 @@ static struct dma_async_tx_descriptor *k3_dma_prep_slave_sg( num += DIV_ROUND_UP(avail, DMA_MAX_SIZE) - 1; } - ds = kzalloc(sizeof(*ds) + num * sizeof(ds->desc_hw[0]), GFP_ATOMIC); + ds = k3_dma_alloc_desc_resource(num, chan); if (!ds) return NULL; - - ds->desc_hw_lli = __virt_to_phys((unsigned long)&ds->desc_hw[0]); - ds->desc_num = num; num = 0; for_each_sg(sgl, sg, sglen, i) { @@ -564,13 +593,10 @@ k3_dma_prep_dma_cyclic(struct dma_chan *chan, dma_addr_t buf_addr, if (avail > modulo) num += DIV_ROUND_UP(avail, modulo) - 1; - ds = kzalloc(sizeof(*ds) + num * sizeof(ds->desc_hw[0]), GFP_ATOMIC); + ds = k3_dma_alloc_desc_resource(num, chan); if (!ds) return NULL; - ds->desc_hw_lli = __virt_to_phys((unsigned long)&ds->desc_hw[0]); - ds->desc_num = num; - c->cyclic = 1; addr = buf_addr; avail = buf_len; @@ -664,7 +690,9 @@ static void k3_dma_free_desc(struct virt_dma_desc *vd) { struct k3_dma_desc_sw *ds = container_of(vd, struct k3_dma_desc_sw, vd); + struct k3_dma_dev *d = to_k3_dma(vd->tx.chan->device); + dma_pool_free(d->pool, ds->desc_hw, ds->desc_hw_lli); kfree(ds); } @@ -810,6 +838,12 @@ static int k3_dma_probe(struct platform_device *op) if (ret) return ret; + /* A DMA memory pool for LLIs, align on 32-byte boundary */ + d->pool = dmam_pool_create(DRIVER_NAME, &op->dev, + LLI_BLOCK_SIZE, 32, 0); + if (!d->pool) + return -ENOMEM; + /* init phy channel */ d->phy = devm_kzalloc(&op->dev, d->dma_channels * sizeof(struct k3_dma_phy), GFP_KERNEL);
After lots of debugging on an occasional DMA ERR issue, I realized that the desc structures which we point the dma hardware are being allocated out of regular memory. This means when we fill the desc structures, that data doesn't always get flushed out to memory by the time we start the dma transfer, resulting in the dma engine getting some null values, resulting in a DMA ERR on the first irq. Thus, this patch adopts mechanism similar to the zx296702_dma of allocating the desc structures from a dma pool, so the memory caching rules are properly set to avoid this issue. Cc: Zhangfei Gao <zhangfei.gao@linaro.org> Cc: Jingoo Han <jg1.han@samsung.com> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Maxime Ripard <maxime.ripard@free-electrons.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Mark Brown <broonie@kernel.org> Cc: Andy Green <andy@warmcat.com> Signed-off-by: John Stutlz <john.stultz@linaro.org> --- drivers/dma/k3dma.c | 58 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 46 insertions(+), 12 deletions(-) -- 1.9.1