Message ID | 1395186958-5594-1-git-send-email-bill.fischofer@linaro.org |
---|---|
State | Superseded, archived |
Headers | show |
Hi, It seems to work, except that those memory syncs are still needed. See comments below. On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote: > Signed-off-by: Bill Fischofer <bill.fi...@linaro.org <javascript:>> > --- > include/odp_barrier.h | 3 +- > platform/linux-generic/source/odp_barrier.c | 46 > +++++++++++++---------------- > 2 files changed, 21 insertions(+), 28 deletions(-) > > diff --git a/include/odp_barrier.h b/include/odp_barrier.h > index bb4a6c5..0a1404b 100644 > --- a/include/odp_barrier.h > +++ b/include/odp_barrier.h > @@ -28,8 +28,7 @@ extern "C" { > */ > typedef struct odp_barrier_t { > int count; > - odp_atomic_int_t in; > - odp_atomic_int_t out; > + odp_atomic_int_t bar; > } odp_barrier_t; > > > diff --git a/platform/linux-generic/source/odp_barrier.c > b/platform/linux-generic/source/odp_barrier.c > index 64fbdb9..9dc6fb5 100644 > --- a/platform/linux-generic/source/odp_barrier.c > +++ b/platform/linux-generic/source/odp_barrier.c > @@ -11,40 +11,34 @@ > void odp_barrier_init_count(odp_barrier_t *barrier, int count) > { > barrier->count = count; > - barrier->in = 0; > - barrier->out = count - 1; > - odp_sync_stores(); > Write sync needed to avoid race between init and first barrier_sync call. > + barrier->bar = 0; > } > > +/* > + * Efficient barrier_sync - > + * > + * Barriers are initialized with a count of the number of callers > + * that must sync on the barrier before any may proceed. > + * > + * To avoid race conditions and to permit the barrier to be fully > + * reusable, the barrier value cycles between 0..2*count-1. When > + * synchronizing the wasless variable simply tracks which half of > + * the cycle the barrier was in upon entry. Exit is when the > + * barrier crosses to the other half of the cycle. > + */ > > void odp_barrier_sync(odp_barrier_t *barrier) > { > int count; > + int wasless; > > - odp_sync_stores(); > Write sync needed to ensure data ordering over the barrier (what happened before barrier vs. what happens after it). > - > - count = odp_atomic_fetch_inc_int(&barrier->in); > - > - if (count == barrier->count - 1) { > - /* If last thread, release others */ > - barrier->in = 0; > - odp_sync_stores(); > - > - /* Wait for others to exit */ > - while (barrier->out) > - odp_spin(); > - > - /* Ready, reset out counter */ > - barrier->out = barrier->count - 1; > - odp_sync_stores(); > + wasless = barrier->bar < barrier->count; > + count = odp_atomic_fetch_inc_int(&barrier->bar); > > + if (count == 2*barrier->count-1) { > + barrier->bar = 0; > } else { > - /* Wait for the last thread*/ > - while (barrier->in) > - odp_spin(); > - > - /* Ready */ > - odp_atomic_dec_int(&barrier->out); > - odp_mem_barrier(); > + while ((barrier->bar < barrier->count) == wasless) > + odp_spin(); > } > odp_mem_barrier(); Memory barrier needed at the end to ban optimizer to move other code inside the barrier. -Petri
The odp_barrier_t definition defines the barrier as of type odp_atomic_int_t, which is volatile. Volatile already requires that the compiler preserve these orderings so the additional syncs are unnecessary. The various memory sync instructions are designed to be used when interacting with areas of memory referenced by HW, not other CPUs. A typical example where such instructions is needed is a HW interface exposed as two MMRs where a write to one location requests an operation whose result is read from another location. A memory sync is needed between the write and the read to ensure that the write is propagated from the CPU before the read is presented on the bus. Inter-CPU race conditions are precisely what the fetch-and-op instructions do, which is intrinsic to the definition of a barrier. Volatile protects against intra-CPU issues. The only potential race between barrier_init() and barrier_sync() would be if the application fires off threads that start doing barrier_sync() before barrier_init() is called. In this case a memory sync in barrier_init() offers no protection since the ordering of operations performed by another CPU cannot be determined by the action of one CPU. That's an application design issue: barriers must be initialized before they are used, e.g., using critical sections. An application would normally initialize a barrier before it begins parallel operations that reference them. Again a correct implementation of volatile is sufficient to provide protection here. Do we have specific conditions in mind that seem to require the extra syncs or is this a "just in case" sort of protection? Bill On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen < petri.savolainen@linaro.org> wrote: > Hi, > > It seems to work, except that those memory syncs are still needed. See > comments below. > > > On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote: > >> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org> >> --- >> include/odp_barrier.h | 3 +- >> platform/linux-generic/source/odp_barrier.c | 46 >> +++++++++++++---------------- >> 2 files changed, 21 insertions(+), 28 deletions(-) >> >> diff --git a/include/odp_barrier.h b/include/odp_barrier.h >> index bb4a6c5..0a1404b 100644 >> --- a/include/odp_barrier.h >> +++ b/include/odp_barrier.h >> @@ -28,8 +28,7 @@ extern "C" { >> */ >> typedef struct odp_barrier_t { >> int count; >> - odp_atomic_int_t in; >> - odp_atomic_int_t out; >> + odp_atomic_int_t bar; >> } odp_barrier_t; >> >> >> diff --git a/platform/linux-generic/source/odp_barrier.c >> b/platform/linux-generic/source/odp_barrier.c >> index 64fbdb9..9dc6fb5 100644 >> --- a/platform/linux-generic/source/odp_barrier.c >> +++ b/platform/linux-generic/source/odp_barrier.c >> @@ -11,40 +11,34 @@ >> void odp_barrier_init_count(odp_barrier_t *barrier, int count) >> { >> barrier->count = count; >> - barrier->in = 0; >> - barrier->out = count - 1; >> - odp_sync_stores(); >> > Write sync needed to avoid race between init and first barrier_sync call. > > >> + barrier->bar = 0; >> } >> >> +/* >> + * Efficient barrier_sync - >> + * >> + * Barriers are initialized with a count of the number of callers >> + * that must sync on the barrier before any may proceed. >> + * >> + * To avoid race conditions and to permit the barrier to be fully >> + * reusable, the barrier value cycles between 0..2*count-1. When >> + * synchronizing the wasless variable simply tracks which half of >> + * the cycle the barrier was in upon entry. Exit is when the >> + * barrier crosses to the other half of the cycle. >> + */ >> >> void odp_barrier_sync(odp_barrier_t *barrier) >> { >> int count; >> + int wasless; >> >> - odp_sync_stores(); >> > Write sync needed to ensure data ordering over the barrier (what happened > before barrier vs. what happens after it). > > >> - >> - count = odp_atomic_fetch_inc_int(&barrier->in); >> - >> - if (count == barrier->count - 1) { >> - /* If last thread, release others */ >> - barrier->in = 0; >> - odp_sync_stores(); >> - >> - /* Wait for others to exit */ >> - while (barrier->out) >> - odp_spin(); >> - >> - /* Ready, reset out counter */ >> - barrier->out = barrier->count - 1; >> - odp_sync_stores(); >> + wasless = barrier->bar < barrier->count; >> + count = odp_atomic_fetch_inc_int(&barrier->bar); >> >> + if (count == 2*barrier->count-1) { >> + barrier->bar = 0; >> } else { >> - /* Wait for the last thread*/ >> - while (barrier->in) >> - odp_spin(); >> - >> - /* Ready */ >> - odp_atomic_dec_int(&barrier->out); >> - odp_mem_barrier(); > > >> + while ((barrier->bar < barrier->count) == wasless) >> + odp_spin(); >> } >> > > odp_mem_barrier(); > Memory barrier needed at the end to ban optimizer to move other code > inside the barrier. > > -Petri > > >
Hi, Believe me, those syncs are needed. Volatiles don't have anything to do with those syncs. Odp_sync_stores are used to force data store ordering within a CPU, otherwise you cannot be sure when those stores actually hit the memory (out-of-order execution + write buffering). E.g. without store sync, those init function stores may hit memory _after_ other threads have been created/launched and already executed their first barrier syncs... Same goes with optimizer, C lines after the barrier sync may crawl inside the barrier sync code. Odp_mem_barrier prevents that to happen. -Petri On Wednesday, 19 March 2014 13:20:37 UTC+2, Bill Fischofer wrote: > > The odp_barrier_t definition defines the barrier as of type > odp_atomic_int_t, which is volatile. Volatile already requires that the > compiler preserve these orderings so the additional syncs are unnecessary. > > The various memory sync instructions are designed to be used when > interacting with areas of memory referenced by HW, not other CPUs. A > typical example where such instructions is needed is a HW interface exposed > as two MMRs where a write to one location requests an operation whose > result is read from another location. A memory sync is needed between the > write and the read to ensure that the write is propagated from the CPU > before the read is presented on the bus. > > Inter-CPU race conditions are precisely what the fetch-and-op instructions > do, which is intrinsic to the definition of a barrier. Volatile protects > against intra-CPU issues. > > The only potential race between barrier_init() and barrier_sync() would be > if the application fires off threads that start doing barrier_sync() before > barrier_init() is called. In this case a memory sync in barrier_init() > offers no protection since the ordering of operations performed by another > CPU cannot be determined by the action of one CPU. That's an application > design issue: barriers must be initialized before they are used, e.g., > using critical sections. An application would normally initialize a > barrier before it begins parallel operations that reference them. Again a > correct implementation of volatile is sufficient to provide protection here. > > Do we have specific conditions in mind that seem to require the extra > syncs or is this a "just in case" sort of protection? > > Bill > > > On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen <petri.sa...@linaro.org<javascript:> > > wrote: > >> Hi, >> >> It seems to work, except that those memory syncs are still needed. See >> comments below. >> >> >> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote: >> >>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org> >>> --- >>> include/odp_barrier.h | 3 +- >>> platform/linux-generic/source/odp_barrier.c | 46 >>> +++++++++++++---------------- >>> 2 files changed, 21 insertions(+), 28 deletions(-) >>> >>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h >>> index bb4a6c5..0a1404b 100644 >>> --- a/include/odp_barrier.h >>> +++ b/include/odp_barrier.h >>> @@ -28,8 +28,7 @@ extern "C" { >>> */ >>> typedef struct odp_barrier_t { >>> int count; >>> - odp_atomic_int_t in; >>> - odp_atomic_int_t out; >>> + odp_atomic_int_t bar; >>> } odp_barrier_t; >>> >>> >>> diff --git a/platform/linux-generic/source/odp_barrier.c >>> b/platform/linux-generic/source/odp_barrier.c >>> index 64fbdb9..9dc6fb5 100644 >>> --- a/platform/linux-generic/source/odp_barrier.c >>> +++ b/platform/linux-generic/source/odp_barrier.c >>> @@ -11,40 +11,34 @@ >>> void odp_barrier_init_count(odp_barrier_t *barrier, int count) >>> { >>> barrier->count = count; >>> - barrier->in = 0; >>> - barrier->out = count - 1; >>> - odp_sync_stores(); >>> >> Write sync needed to avoid race between init and first barrier_sync call. >> >> >>> + barrier->bar = 0; >>> } >>> >>> +/* >>> + * Efficient barrier_sync - >>> + * >>> + * Barriers are initialized with a count of the number of callers >>> + * that must sync on the barrier before any may proceed. >>> + * >>> + * To avoid race conditions and to permit the barrier to be fully >>> + * reusable, the barrier value cycles between 0..2*count-1. When >>> + * synchronizing the wasless variable simply tracks which half of >>> + * the cycle the barrier was in upon entry. Exit is when the >>> + * barrier crosses to the other half of the cycle. >>> + */ >>> >>> void odp_barrier_sync(odp_barrier_t *barrier) >>> { >>> int count; >>> + int wasless; >>> >>> - odp_sync_stores(); >>> >> Write sync needed to ensure data ordering over the barrier (what happened >> before barrier vs. what happens after it). >> >> >>> - >>> - count = odp_atomic_fetch_inc_int(&barrier->in); >>> - >>> - if (count == barrier->count - 1) { >>> - /* If last thread, release others */ >>> - barrier->in = 0; >>> - odp_sync_stores(); >>> - >>> - /* Wait for others to exit */ >>> - while (barrier->out) >>> - odp_spin(); >>> - >>> - /* Ready, reset out counter */ >>> - barrier->out = barrier->count - 1; >>> - odp_sync_stores(); >>> + wasless = barrier->bar < barrier->count; >>> + count = odp_atomic_fetch_inc_int(&barrier->bar); >>> >>> + if (count == 2*barrier->count-1) { >>> + barrier->bar = 0; >>> } else { >>> - /* Wait for the last thread*/ >>> - while (barrier->in) >>> - odp_spin(); >>> - >>> - /* Ready */ >>> - odp_atomic_dec_int(&barrier->out); >>> - odp_mem_barrier(); >> >> >>> + while ((barrier->bar < barrier->count) == wasless) >>> + odp_spin(); >>> } >>> >> >> odp_mem_barrier(); >> Memory barrier needed at the end to ban optimizer to move other code >> inside the barrier. >> >> -Petri >> >> >> >
At least on PPC, memory sync (sync, msync) instructions affect how other CPUs perceive the order of memory operations. This is a quote from the E-cores programming manual: "If the programmer must ensure that cache or other instructions have been performed with respect to all other processors and system mechanisms, an msync must be placed after those instructions." Fetch-and-op primitives on PPC are translated to this kind of sequence : loop: lwarx r5,0,r3 #load and reserve add r0,r4,r5 #increment word stwcx. r0,0,r3 #store new value if still reserved bc 4,2,loop #loop if lost reservation The order of lwarx and swtcx kept by default, there is no need for memory barriers here. Hope this helps, Alex On 19 March 2014 13:20, Bill Fischofer <bill.fischofer@linaro.org> wrote: > The odp_barrier_t definition defines the barrier as of type > odp_atomic_int_t, which is volatile. Volatile already requires that the > compiler preserve these orderings so the additional syncs are unnecessary. > > The various memory sync instructions are designed to be used when > interacting with areas of memory referenced by HW, not other CPUs. A > typical example where such instructions is needed is a HW interface exposed > as two MMRs where a write to one location requests an operation whose > result is read from another location. A memory sync is needed between the > write and the read to ensure that the write is propagated from the CPU > before the read is presented on the bus. > > Inter-CPU race conditions are precisely what the fetch-and-op instructions > do, which is intrinsic to the definition of a barrier. Volatile protects > against intra-CPU issues. > > The only potential race between barrier_init() and barrier_sync() would be > if the application fires off threads that start doing barrier_sync() before > barrier_init() is called. In this case a memory sync in barrier_init() > offers no protection since the ordering of operations performed by another > CPU cannot be determined by the action of one CPU. That's an application > design issue: barriers must be initialized before they are used, e.g., > using critical sections. An application would normally initialize a > barrier before it begins parallel operations that reference them. Again a > correct implementation of volatile is sufficient to provide protection here. > > Do we have specific conditions in mind that seem to require the extra > syncs or is this a "just in case" sort of protection? > > Bill > > > On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen < > petri.savolainen@linaro.org> wrote: > >> Hi, >> >> It seems to work, except that those memory syncs are still needed. See >> comments below. >> >> >> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote: >> >>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org> >>> --- >>> include/odp_barrier.h | 3 +- >>> platform/linux-generic/source/odp_barrier.c | 46 >>> +++++++++++++---------------- >>> 2 files changed, 21 insertions(+), 28 deletions(-) >>> >>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h >>> index bb4a6c5..0a1404b 100644 >>> --- a/include/odp_barrier.h >>> +++ b/include/odp_barrier.h >>> @@ -28,8 +28,7 @@ extern "C" { >>> */ >>> typedef struct odp_barrier_t { >>> int count; >>> - odp_atomic_int_t in; >>> - odp_atomic_int_t out; >>> + odp_atomic_int_t bar; >>> } odp_barrier_t; >>> >>> >>> diff --git a/platform/linux-generic/source/odp_barrier.c >>> b/platform/linux-generic/source/odp_barrier.c >>> index 64fbdb9..9dc6fb5 100644 >>> --- a/platform/linux-generic/source/odp_barrier.c >>> +++ b/platform/linux-generic/source/odp_barrier.c >>> @@ -11,40 +11,34 @@ >>> void odp_barrier_init_count(odp_barrier_t *barrier, int count) >>> { >>> barrier->count = count; >>> - barrier->in = 0; >>> - barrier->out = count - 1; >>> - odp_sync_stores(); >>> >> Write sync needed to avoid race between init and first barrier_sync call. >> >> >>> + barrier->bar = 0; >>> } >>> >>> +/* >>> + * Efficient barrier_sync - >>> + * >>> + * Barriers are initialized with a count of the number of callers >>> + * that must sync on the barrier before any may proceed. >>> + * >>> + * To avoid race conditions and to permit the barrier to be fully >>> + * reusable, the barrier value cycles between 0..2*count-1. When >>> + * synchronizing the wasless variable simply tracks which half of >>> + * the cycle the barrier was in upon entry. Exit is when the >>> + * barrier crosses to the other half of the cycle. >>> + */ >>> >>> void odp_barrier_sync(odp_barrier_t *barrier) >>> { >>> int count; >>> + int wasless; >>> >>> - odp_sync_stores(); >>> >> Write sync needed to ensure data ordering over the barrier (what happened >> before barrier vs. what happens after it). >> >> >>> - >>> - count = odp_atomic_fetch_inc_int(&barrier->in); >>> - >>> - if (count == barrier->count - 1) { >>> - /* If last thread, release others */ >>> - barrier->in = 0; >>> - odp_sync_stores(); >>> - >>> - /* Wait for others to exit */ >>> - while (barrier->out) >>> - odp_spin(); >>> - >>> - /* Ready, reset out counter */ >>> - barrier->out = barrier->count - 1; >>> - odp_sync_stores(); >>> + wasless = barrier->bar < barrier->count; >>> + count = odp_atomic_fetch_inc_int(&barrier->bar); >>> >>> + if (count == 2*barrier->count-1) { >>> + barrier->bar = 0; >>> } else { >>> - /* Wait for the last thread*/ >>> - while (barrier->in) >>> - odp_spin(); >>> - >>> - /* Ready */ >>> - odp_atomic_dec_int(&barrier->out); >>> - odp_mem_barrier(); >> >> >>> + while ((barrier->bar < barrier->count) == wasless) >>> + odp_spin(); >>> } >>> >> >> odp_mem_barrier(); >> Memory barrier needed at the end to ban optimizer to move other code >> inside the barrier. >> >> -Petri >> >> >> > -- > You received this message because you are subscribed to the Google Groups > "LNG ODP Sub-team - lng-odp@linaro.org" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to lng-odp+unsubscribe@linaro.org. > To post to this group, send email to lng-odp@linaro.org. > Visit this group at http://groups.google.com/a/linaro.org/group/lng-odp/. > To view this discussion on the web visit > https://groups.google.com/a/linaro.org/d/msgid/lng-odp/CAKb83kbTCB2JiNjRBfcHo-VUBoREyEd8dueabPV8cQjrUpLY-A%40mail.gmail.com<https://groups.google.com/a/linaro.org/d/msgid/lng-odp/CAKb83kbTCB2JiNjRBfcHo-VUBoREyEd8dueabPV8cQjrUpLY-A%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/a/linaro.org/d/optout. >
I'll buy the odp_sync_stores() in the init routine but given that the new barrier_sync() only references the barrier via fetch-and-op calls it should be safe as-is. If the barrier reset (back to 0) is delayed for any reason that doesn't matter since the code doesn't care when that store is actually performed. Bill On Wed, Mar 19, 2014 at 6:57 AM, Petri Savolainen < petri.savolainen@linaro.org> wrote: > Hi, > > Believe me, those syncs are needed. Volatiles don't have anything to do > with those syncs. > > Odp_sync_stores are used to force data store ordering within a CPU, > otherwise you cannot be sure when those stores actually hit the memory > (out-of-order execution + write buffering). E.g. without store sync, those > init function stores may hit memory _after_ other threads have been > created/launched and already executed their first barrier syncs... > > Same goes with optimizer, C lines after the barrier sync may crawl inside > the barrier sync code. Odp_mem_barrier prevents that to happen. > > > -Petri > > > > On Wednesday, 19 March 2014 13:20:37 UTC+2, Bill Fischofer wrote: > >> The odp_barrier_t definition defines the barrier as of type >> odp_atomic_int_t, which is volatile. Volatile already requires that the >> compiler preserve these orderings so the additional syncs are unnecessary. >> >> The various memory sync instructions are designed to be used when >> interacting with areas of memory referenced by HW, not other CPUs. A >> typical example where such instructions is needed is a HW interface exposed >> as two MMRs where a write to one location requests an operation whose >> result is read from another location. A memory sync is needed between the >> write and the read to ensure that the write is propagated from the CPU >> before the read is presented on the bus. >> >> Inter-CPU race conditions are precisely what the fetch-and-op >> instructions do, which is intrinsic to the definition of a barrier. >> Volatile protects against intra-CPU issues. >> >> The only potential race between barrier_init() and barrier_sync() would >> be if the application fires off threads that start doing barrier_sync() >> before barrier_init() is called. In this case a memory sync in >> barrier_init() offers no protection since the ordering of operations >> performed by another CPU cannot be determined by the action of one CPU. >> That's an application design issue: barriers must be initialized before >> they are used, e.g., using critical sections. An application would >> normally initialize a barrier before it begins parallel operations that >> reference them. Again a correct implementation of volatile is sufficient >> to provide protection here. >> >> Do we have specific conditions in mind that seem to require the extra >> syncs or is this a "just in case" sort of protection? >> >> Bill >> >> >> On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen <petri.sa...@linaro.org >> > wrote: >> >>> Hi, >>> >>> It seems to work, except that those memory syncs are still needed. See >>> comments below. >>> >>> >>> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote: >>> >>>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org> >>>> --- >>>> include/odp_barrier.h | 3 +- >>>> platform/linux-generic/source/odp_barrier.c | 46 >>>> +++++++++++++---------------- >>>> 2 files changed, 21 insertions(+), 28 deletions(-) >>>> >>>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h >>>> index bb4a6c5..0a1404b 100644 >>>> --- a/include/odp_barrier.h >>>> +++ b/include/odp_barrier.h >>>> @@ -28,8 +28,7 @@ extern "C" { >>>> */ >>>> typedef struct odp_barrier_t { >>>> int count; >>>> - odp_atomic_int_t in; >>>> - odp_atomic_int_t out; >>>> + odp_atomic_int_t bar; >>>> } odp_barrier_t; >>>> >>>> >>>> diff --git a/platform/linux-generic/source/odp_barrier.c >>>> b/platform/linux-generic/source/odp_barrier.c >>>> index 64fbdb9..9dc6fb5 100644 >>>> --- a/platform/linux-generic/source/odp_barrier.c >>>> +++ b/platform/linux-generic/source/odp_barrier.c >>>> @@ -11,40 +11,34 @@ >>>> void odp_barrier_init_count(odp_barrier_t *barrier, int count) >>>> { >>>> barrier->count = count; >>>> - barrier->in = 0; >>>> - barrier->out = count - 1; >>>> - odp_sync_stores(); >>>> >>> Write sync needed to avoid race between init and first barrier_sync call. >>> >>> >>>> + barrier->bar = 0; >>>> } >>>> >>>> +/* >>>> + * Efficient barrier_sync - >>>> + * >>>> + * Barriers are initialized with a count of the number of callers >>>> + * that must sync on the barrier before any may proceed. >>>> + * >>>> + * To avoid race conditions and to permit the barrier to be fully >>>> + * reusable, the barrier value cycles between 0..2*count-1. When >>>> + * synchronizing the wasless variable simply tracks which half of >>>> + * the cycle the barrier was in upon entry. Exit is when the >>>> + * barrier crosses to the other half of the cycle. >>>> + */ >>>> >>>> void odp_barrier_sync(odp_barrier_t *barrier) >>>> { >>>> int count; >>>> + int wasless; >>>> >>>> - odp_sync_stores(); >>>> >>> Write sync needed to ensure data ordering over the barrier (what >>> happened before barrier vs. what happens after it). >>> >>> >>>> - >>>> - count = odp_atomic_fetch_inc_int(&barrier->in); >>>> - >>>> - if (count == barrier->count - 1) { >>>> - /* If last thread, release others */ >>>> - barrier->in = 0; >>>> - odp_sync_stores(); >>>> - >>>> - /* Wait for others to exit */ >>>> - while (barrier->out) >>>> - odp_spin(); >>>> - >>>> - /* Ready, reset out counter */ >>>> - barrier->out = barrier->count - 1; >>>> - odp_sync_stores(); >>>> + wasless = barrier->bar < barrier->count; >>>> + count = odp_atomic_fetch_inc_int(&barrier->bar); >>>> >>>> + if (count == 2*barrier->count-1) { >>>> + barrier->bar = 0; >>>> } else { >>>> - /* Wait for the last thread*/ >>>> - while (barrier->in) >>>> - odp_spin(); >>>> - >>>> - /* Ready */ >>>> - odp_atomic_dec_int(&barrier->out); >>>> - odp_mem_barrier(); >>> >>> >>>> + while ((barrier->bar < barrier->count) == wasless) >>>> + odp_spin(); >>>> } >>>> >>> >>> odp_mem_barrier(); >>> Memory barrier needed at the end to ban optimizer to move other code >>> inside the barrier. >>> >>> -Petri >>> >>> >>> >> -- > You received this message because you are subscribed to the Google Groups > "LNG ODP Sub-team - lng-odp@linaro.org" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to lng-odp+unsubscribe@linaro.org. > To post to this group, send email to lng-odp@linaro.org. > Visit this group at http://groups.google.com/a/linaro.org/group/lng-odp/. > To view this discussion on the web visit > https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org<https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/a/linaro.org/d/optout. >
Hi, OK, third time. What is the purpose of odp_barrier? It synchronizes/separates execution of code lines before the barrier from code lines following the barrier. If the barrier does not include store synchronization, stores issued before the barrier may complete any random time (after) the barrier. E.g. if barrier is used to ensure that all cores have compiled operation xyz (called before the barrier), a barrier without store sync would not guarantee that. It's not about correctness of the barrier algoritm, it's about synchronization of the user data (that barrier should protect). Same goes for the mem barrier at the end. Without it, operation zyx (after the barrier) may start executing inside the barrier - and again user data synchronization would be lost. -Petri On Wednesday, 19 March 2014 17:53:31 UTC+2, Bill Fischofer wrote: > > I'll buy the odp_sync_stores() in the init routine but given that the new > barrier_sync() only references the barrier via fetch-and-op calls it should > be safe as-is. If the barrier reset (back to 0) is delayed for any reason > that doesn't matter since the code doesn't care when that store is actually > performed. > > Bill > > > On Wed, Mar 19, 2014 at 6:57 AM, Petri Savolainen <petri.sa...@linaro.org<javascript:> > > wrote: > >> Hi, >> >> Believe me, those syncs are needed. Volatiles don't have anything to do >> with those syncs. >> >> Odp_sync_stores are used to force data store ordering within a CPU, >> otherwise you cannot be sure when those stores actually hit the memory >> (out-of-order execution + write buffering). E.g. without store sync, those >> init function stores may hit memory _after_ other threads have been >> created/launched and already executed their first barrier syncs... >> >> Same goes with optimizer, C lines after the barrier sync may crawl inside >> the barrier sync code. Odp_mem_barrier prevents that to happen. >> >> >> -Petri >> >> >> >> On Wednesday, 19 March 2014 13:20:37 UTC+2, Bill Fischofer wrote: >> >>> The odp_barrier_t definition defines the barrier as of type >>> odp_atomic_int_t, which is volatile. Volatile already requires that the >>> compiler preserve these orderings so the additional syncs are unnecessary. >>> >>> The various memory sync instructions are designed to be used when >>> interacting with areas of memory referenced by HW, not other CPUs. A >>> typical example where such instructions is needed is a HW interface exposed >>> as two MMRs where a write to one location requests an operation whose >>> result is read from another location. A memory sync is needed between the >>> write and the read to ensure that the write is propagated from the CPU >>> before the read is presented on the bus. >>> >>> Inter-CPU race conditions are precisely what the fetch-and-op >>> instructions do, which is intrinsic to the definition of a barrier. >>> Volatile protects against intra-CPU issues. >>> >>> The only potential race between barrier_init() and barrier_sync() would >>> be if the application fires off threads that start doing barrier_sync() >>> before barrier_init() is called. In this case a memory sync in >>> barrier_init() offers no protection since the ordering of operations >>> performed by another CPU cannot be determined by the action of one CPU. >>> That's an application design issue: barriers must be initialized before >>> they are used, e.g., using critical sections. An application would >>> normally initialize a barrier before it begins parallel operations that >>> reference them. Again a correct implementation of volatile is sufficient >>> to provide protection here. >>> >>> Do we have specific conditions in mind that seem to require the extra >>> syncs or is this a "just in case" sort of protection? >>> >>> Bill >>> >>> >>> On Wed, Mar 19, 2014 at 2:47 AM, Petri Savolainen < >>> petri.sa...@linaro.org> wrote: >>> >>>> Hi, >>>> >>>> It seems to work, except that those memory syncs are still needed. See >>>> comments below. >>>> >>>> >>>> On Wednesday, 19 March 2014 01:55:58 UTC+2, Bill Fischofer wrote: >>>> >>>>> Signed-off-by: Bill Fischofer <bill.fi...@linaro.org> >>>>> --- >>>>> include/odp_barrier.h | 3 +- >>>>> platform/linux-generic/source/odp_barrier.c | 46 >>>>> +++++++++++++---------------- >>>>> 2 files changed, 21 insertions(+), 28 deletions(-) >>>>> >>>>> diff --git a/include/odp_barrier.h b/include/odp_barrier.h >>>>> index bb4a6c5..0a1404b 100644 >>>>> --- a/include/odp_barrier.h >>>>> +++ b/include/odp_barrier.h >>>>> @@ -28,8 +28,7 @@ extern "C" { >>>>> */ >>>>> typedef struct odp_barrier_t { >>>>> int count; >>>>> - odp_atomic_int_t in; >>>>> - odp_atomic_int_t out; >>>>> + odp_atomic_int_t bar; >>>>> } odp_barrier_t; >>>>> >>>>> >>>>> diff --git a/platform/linux-generic/source/odp_barrier.c >>>>> b/platform/linux-generic/source/odp_barrier.c >>>>> index 64fbdb9..9dc6fb5 100644 >>>>> --- a/platform/linux-generic/source/odp_barrier.c >>>>> +++ b/platform/linux-generic/source/odp_barrier.c >>>>> @@ -11,40 +11,34 @@ >>>>> void odp_barrier_init_count(odp_barrier_t *barrier, int count) >>>>> { >>>>> barrier->count = count; >>>>> - barrier->in = 0; >>>>> - barrier->out = count - 1; >>>>> - odp_sync_stores(); >>>>> >>>> Write sync needed to avoid race between init and first barrier_sync >>>> call. >>>> >>>> >>>>> + barrier->bar = 0; >>>>> } >>>>> >>>>> +/* >>>>> + * Efficient barrier_sync - >>>>> + * >>>>> + * Barriers are initialized with a count of the number of callers >>>>> + * that must sync on the barrier before any may proceed. >>>>> + * >>>>> + * To avoid race conditions and to permit the barrier to be fully >>>>> + * reusable, the barrier value cycles between 0..2*count-1. When >>>>> + * synchronizing the wasless variable simply tracks which half of >>>>> + * the cycle the barrier was in upon entry. Exit is when the >>>>> + * barrier crosses to the other half of the cycle. >>>>> + */ >>>>> >>>>> void odp_barrier_sync(odp_barrier_t *barrier) >>>>> { >>>>> int count; >>>>> + int wasless; >>>>> >>>>> - odp_sync_stores(); >>>>> >>>> Write sync needed to ensure data ordering over the barrier (what >>>> happened before barrier vs. what happens after it). >>>> >>>> >>>>> - >>>>> - count = odp_atomic_fetch_inc_int(&barrier->in); >>>>> - >>>>> - if (count == barrier->count - 1) { >>>>> - /* If last thread, release others */ >>>>> - barrier->in = 0; >>>>> - odp_sync_stores(); >>>>> - >>>>> - /* Wait for others to exit */ >>>>> - while (barrier->out) >>>>> - odp_spin(); >>>>> - >>>>> - /* Ready, reset out counter */ >>>>> - barrier->out = barrier->count - 1; >>>>> - odp_sync_stores(); >>>>> + wasless = barrier->bar < barrier->count; >>>>> + count = odp_atomic_fetch_inc_int(&barrier->bar); >>>>> >>>>> + if (count == 2*barrier->count-1) { >>>>> + barrier->bar = 0; >>>>> } else { >>>>> - /* Wait for the last thread*/ >>>>> - while (barrier->in) >>>>> - odp_spin(); >>>>> - >>>>> - /* Ready */ >>>>> - odp_atomic_dec_int(&barrier->out); >>>>> - odp_mem_barrier(); >>>> >>>> >>>>> + while ((barrier->bar < barrier->count) == wasless) >>>>> + odp_spin(); >>>>> } >>>>> >>>> >>>> odp_mem_barrier(); >>>> Memory barrier needed at the end to ban optimizer to move other code >>>> inside the barrier. >>>> >>>> -Petri >>>> >>>> >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "LNG ODP Sub-team - lng...@linaro.org <javascript:>" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to lng-odp+u...@linaro.org <javascript:>. >> To post to this group, send email to lng...@linaro.org <javascript:>. >> Visit this group at http://groups.google.com/a/linaro.org/group/lng-odp/. >> To view this discussion on the web visit >> https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org<https://groups.google.com/a/linaro.org/d/msgid/lng-odp/abde0bfa-372c-4e03-9d36-842fa97c6d75%40linaro.org?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/a/linaro.org/d/optout. >> > >
diff --git a/include/odp_barrier.h b/include/odp_barrier.h index bb4a6c5..0a1404b 100644 --- a/include/odp_barrier.h +++ b/include/odp_barrier.h @@ -28,8 +28,7 @@ extern "C" { */ typedef struct odp_barrier_t { int count; - odp_atomic_int_t in; - odp_atomic_int_t out; + odp_atomic_int_t bar; } odp_barrier_t; diff --git a/platform/linux-generic/source/odp_barrier.c b/platform/linux-generic/source/odp_barrier.c index 64fbdb9..9dc6fb5 100644 --- a/platform/linux-generic/source/odp_barrier.c +++ b/platform/linux-generic/source/odp_barrier.c @@ -11,40 +11,34 @@ void odp_barrier_init_count(odp_barrier_t *barrier, int count) { barrier->count = count; - barrier->in = 0; - barrier->out = count - 1; - odp_sync_stores(); + barrier->bar = 0; } +/* + * Efficient barrier_sync - + * + * Barriers are initialized with a count of the number of callers + * that must sync on the barrier before any may proceed. + * + * To avoid race conditions and to permit the barrier to be fully + * reusable, the barrier value cycles between 0..2*count-1. When + * synchronizing the wasless variable simply tracks which half of + * the cycle the barrier was in upon entry. Exit is when the + * barrier crosses to the other half of the cycle. + */ void odp_barrier_sync(odp_barrier_t *barrier) { int count; + int wasless; - odp_sync_stores(); - - count = odp_atomic_fetch_inc_int(&barrier->in); - - if (count == barrier->count - 1) { - /* If last thread, release others */ - barrier->in = 0; - odp_sync_stores(); - - /* Wait for others to exit */ - while (barrier->out) - odp_spin(); - - /* Ready, reset out counter */ - barrier->out = barrier->count - 1; - odp_sync_stores(); + wasless = barrier->bar < barrier->count; + count = odp_atomic_fetch_inc_int(&barrier->bar); + if (count == 2*barrier->count-1) { + barrier->bar = 0; } else { - /* Wait for the last thread*/ - while (barrier->in) - odp_spin(); - - /* Ready */ - odp_atomic_dec_int(&barrier->out); - odp_mem_barrier(); + while ((barrier->bar < barrier->count) == wasless) + odp_spin(); } }
Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org> --- include/odp_barrier.h | 3 +- platform/linux-generic/source/odp_barrier.c | 46 +++++++++++++---------------- 2 files changed, 21 insertions(+), 28 deletions(-)