[0/4] Address reader-writer concurrency in rte_hash

Message ID	1536253938-192391-1-git-send-email-honnappa.nagarahalli@arm.com
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of dev-bounces@dpdk.org designates 92.243.14.124 as permitted sender) client-ip=92.243.14.124; From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> To: bruce.richardson@intel.com, pablo.de.lara.guarch@intel.com Cc: dev@dpdk.org, honnappa.nagarahalli@dpdk.org, gavin.hu@arm.com, steve.capper@arm.com, ola.liljedahl@arm.com, nd@arm.com, Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> Date: Thu, 6 Sep 2018 12:12:14 -0500 Message-Id: <1536253938-192391-1-git-send-email-honnappa.nagarahalli@arm.com> Subject: [dpdk-dev] [PATCH 0/4] Address reader-writer concurrency in rte_hash Precedence: list Errors-To: dev-bounces@dpdk.org Sender: "dev" <dev-bounces@dpdk.org>
Series	Address reader-writer concurrency in rte_hash \| expand [0/4] Address reader-writer concurrency in rte_hash [1/4] hash: correct key store element alignment [2/4] hash: add memory ordering to avoid race conditions [3/4] hash: fix rw concurrency while moving keys [4/4] hash: enable lock-free reader-writer concurrency

Honnappa Nagarahalli Sept. 6, 2018, 5:12 p.m. UTC

Currently, reader-writer concurrency problems in rte_hash are
    addressed using reader-writer locks. Use of reader-writer locks
    results in following issues:
    
    1) In many of the use cases for the hash table, writer threads
       are running on control plane. If the writer is preempted while
       holding the lock, it will block the readers for an extended period
       resulting in packet drops. This problem seems to apply for platforms
       with transactional memory support as well because of the algorithm
       used for rte_rwlock_write_lock_tm:
    
       static inline void
       rte_rwlock_write_lock_tm(rte_rwlock_t *rwl)
       {
            if (likely(rte_try_tm(&rwl->cnt)))
                    return;
            rte_rwlock_write_lock(rwl);
       }
    
       i.e. there is a posibility of using rte_rwlock_write_lock in
       failure cases.
    2) Reader-writer lock based solution does not address the following
       issue.
       rte_hash_lookup_xxx APIs return the index of the element in
       the key store. Application(reader) can use that index to reference
       other data structures in its scope. Because of this, the
       index should not be freed till the application completes
       using the index.
    3) Since writer blocks all the readers, the hash lookup
       rate comes down significantly when there is activity on the writer.
       This happens even for unrelated entries. Performance numbers
       given below clearly indicate this.
    
    Lock-free solution is required to solve these problems. This patch
    series adds the lock-free capabilities in the following steps:
    
    1) Correct the alignment for the key store entry to prep for
       memory ordering.
    2) Add memory ordering to prevent race conditions when a new key
       is added to the table.
    
    3) Reader-writer concurrency issue, caused by moving the keys
       to their alternate locations during key insert, is solved
       by introducing an atomic global counter indicating a change
       in table.
    
    4) This solution also has to solve the issue of readers using
       key store element even after the key is deleted from
       control plane.
       To solve this issue, the hash_del_key_xxx APIs do not free
       the key store element. The key store element has to be freed
       using the newly introduced rte_hash_free_key_with_position API.
       It needs to be called once all the readers have stopped using
       the key store element. How this is determined is outside
       the scope of this patch (RCU is one such mechanism that the
       application can use).
    
    4) Finally, a lock free reader-writer concurrency flag is added
       to enable this feature at run time.
    
    Performance numbers:
    Scenario: Equal number of writer/reader threads for concurrent
	      writers and readers. For readers only test, the
              entries are added upfront.

    Current code:
	Cores	Lookup     Lookup
		with add
	2	474	   246
	4	935        579
	6	1387       1048
	8	1766       1480
	10	2119       1951
	12	2546       2441

    With this patch:
	Cores	Lookup     Lookup
		with add
	2	291	   211
	4	297	   196
	6	304	   198
	8	309	   202
	10	315	   205
	12	319	   209

Honnappa Nagarahalli (4):
  hash: correct key store element alignment
  hash: add memory ordering to avoid race conditions
  hash: fix rw concurrency while moving keys
  hash: enable lock-free reader-writer concurrency

 lib/librte_hash/rte_cuckoo_hash.c    | 445 +++++++++++++++++++++++++----------
 lib/librte_hash/rte_cuckoo_hash.h    |   6 +-
 lib/librte_hash/rte_hash.h           |  63 ++++-
 lib/librte_hash/rte_hash_version.map |   7 +
 4 files changed, 393 insertions(+), 128 deletions(-)

-- 
2.7.4

Honnappa Nagarahalli Sept. 14, 2018, 9:18 p.m. UTC | #1

I have added the memory ordering ladder diagrams to the DPDK summit slides to help understand the changes. The slides are available at: https://dpdkuserspace2018.sched.com/event/G44w/lock-free-read-write-concurrency-in-rtehash. Please look at the backup slides.

Thank you,
Honnappa

-----Original Message-----
From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> 

Sent: Thursday, September 6, 2018 12:12 PM
To: bruce.richardson@intel.com; pablo.de.lara.guarch@intel.com
Cc: dev@dpdk.org; honnappa.nagarahalli; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl <Ola.Liljedahl@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
Subject: [PATCH 0/4] Address reader-writer concurrency in rte_hash

    Currently, reader-writer concurrency problems in rte_hash are
    addressed using reader-writer locks. Use of reader-writer locks
    results in following issues:
    
    1) In many of the use cases for the hash table, writer threads
       are running on control plane. If the writer is preempted while
       holding the lock, it will block the readers for an extended period
       resulting in packet drops. This problem seems to apply for platforms
       with transactional memory support as well because of the algorithm
       used for rte_rwlock_write_lock_tm:
    
       static inline void
       rte_rwlock_write_lock_tm(rte_rwlock_t *rwl)
       {
            if (likely(rte_try_tm(&rwl->cnt)))
                    return;
            rte_rwlock_write_lock(rwl);
       }
    
       i.e. there is a posibility of using rte_rwlock_write_lock in
       failure cases.
    2) Reader-writer lock based solution does not address the following
       issue.
       rte_hash_lookup_xxx APIs return the index of the element in
       the key store. Application(reader) can use that index to reference
       other data structures in its scope. Because of this, the
       index should not be freed till the application completes
       using the index.
    3) Since writer blocks all the readers, the hash lookup
       rate comes down significantly when there is activity on the writer.
       This happens even for unrelated entries. Performance numbers
       given below clearly indicate this.
    
    Lock-free solution is required to solve these problems. This patch
    series adds the lock-free capabilities in the following steps:
    
    1) Correct the alignment for the key store entry to prep for
       memory ordering.
    2) Add memory ordering to prevent race conditions when a new key
       is added to the table.
    
    3) Reader-writer concurrency issue, caused by moving the keys
       to their alternate locations during key insert, is solved
       by introducing an atomic global counter indicating a change
       in table.
    
    4) This solution also has to solve the issue of readers using
       key store element even after the key is deleted from
       control plane.
       To solve this issue, the hash_del_key_xxx APIs do not free
       the key store element. The key store element has to be freed
       using the newly introduced rte_hash_free_key_with_position API.
       It needs to be called once all the readers have stopped using
       the key store element. How this is determined is outside
       the scope of this patch (RCU is one such mechanism that the
       application can use).
    
    4) Finally, a lock free reader-writer concurrency flag is added
       to enable this feature at run time.
    
    Performance numbers:
    Scenario: Equal number of writer/reader threads for concurrent
	      writers and readers. For readers only test, the
              entries are added upfront.

    Current code:
	Cores	Lookup     Lookup
		with add
	2	474	   246
	4	935        579
	6	1387       1048
	8	1766       1480
	10	2119       1951
	12	2546       2441

    With this patch:
	Cores	Lookup     Lookup
		with add
	2	291	   211
	4	297	   196
	6	304	   198
	8	309	   202
	10	315	   205
	12	319	   209

Honnappa Nagarahalli (4):
  hash: correct key store element alignment
  hash: add memory ordering to avoid race conditions
  hash: fix rw concurrency while moving keys
  hash: enable lock-free reader-writer concurrency

 lib/librte_hash/rte_cuckoo_hash.c    | 445 +++++++++++++++++++++++++----------
 lib/librte_hash/rte_cuckoo_hash.h    |   6 +-
 lib/librte_hash/rte_hash.h           |  63 ++++-
 lib/librte_hash/rte_hash_version.map |   7 +
 4 files changed, 393 insertions(+), 128 deletions(-)

-- 
2.7.4

Honnappa Nagarahalli Sept. 26, 2018, 2:36 p.m. UTC | #2

Hi Bruce/Pablo,
    I need to get this into 18.11, appreciate any review/feedback soon.

Thank you,
Honnappa

> -----Original Message-----

> From: Honnappa Nagarahalli

> Sent: Friday, September 14, 2018 4:19 PM

> To: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>;

> bruce.richardson@intel.com; pablo.de.lara.guarch@intel.com

> Cc: dev@dpdk.org; honnappa.nagarahalli; Gavin Hu (Arm Technology China)

> <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl

> <Ola.Liljedahl@arm.com>; nd <nd@arm.com>; yipeng1.wang@intel.com;

> Michel Machado <michel@digirati.com.br>; sameh.gobriel@intel.com

> Subject: RE: [PATCH 0/4] Address reader-writer concurrency in rte_hash

> 

> I have added the memory ordering ladder diagrams to the DPDK summit slides

> to help understand the changes. The slides are available at:

> https://dpdkuserspace2018.sched.com/event/G44w/lock-free-read-write-

> concurrency-in-rtehash. Please look at the backup slides.

> 

> Thank you,

> Honnappa

> 

> -----Original Message-----

> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

> Sent: Thursday, September 6, 2018 12:12 PM

> To: bruce.richardson@intel.com; pablo.de.lara.guarch@intel.com

> Cc: dev@dpdk.org; honnappa.nagarahalli; Gavin Hu (Arm Technology China)

> <Gavin.Hu@arm.com>; Steve Capper <Steve.Capper@arm.com>; Ola Liljedahl

> <Ola.Liljedahl@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli

> <Honnappa.Nagarahalli@arm.com>

> Subject: [PATCH 0/4] Address reader-writer concurrency in rte_hash

> 

>     Currently, reader-writer concurrency problems in rte_hash are

>     addressed using reader-writer locks. Use of reader-writer locks

>     results in following issues:

> 

>     1) In many of the use cases for the hash table, writer threads

>        are running on control plane. If the writer is preempted while

>        holding the lock, it will block the readers for an extended period

>        resulting in packet drops. This problem seems to apply for platforms

>        with transactional memory support as well because of the algorithm

>        used for rte_rwlock_write_lock_tm:

> 

>        static inline void

>        rte_rwlock_write_lock_tm(rte_rwlock_t *rwl)

>        {

>             if (likely(rte_try_tm(&rwl->cnt)))

>                     return;

>             rte_rwlock_write_lock(rwl);

>        }

> 

>        i.e. there is a posibility of using rte_rwlock_write_lock in

>        failure cases.

>     2) Reader-writer lock based solution does not address the following

>        issue.

>        rte_hash_lookup_xxx APIs return the index of the element in

>        the key store. Application(reader) can use that index to reference

>        other data structures in its scope. Because of this, the

>        index should not be freed till the application completes

>        using the index.

>     3) Since writer blocks all the readers, the hash lookup

>        rate comes down significantly when there is activity on the writer.

>        This happens even for unrelated entries. Performance numbers

>        given below clearly indicate this.

> 

>     Lock-free solution is required to solve these problems. This patch

>     series adds the lock-free capabilities in the following steps:

> 

>     1) Correct the alignment for the key store entry to prep for

>        memory ordering.

>     2) Add memory ordering to prevent race conditions when a new key

>        is added to the table.

> 

>     3) Reader-writer concurrency issue, caused by moving the keys

>        to their alternate locations during key insert, is solved

>        by introducing an atomic global counter indicating a change

>        in table.

> 

>     4) This solution also has to solve the issue of readers using

>        key store element even after the key is deleted from

>        control plane.

>        To solve this issue, the hash_del_key_xxx APIs do not free

>        the key store element. The key store element has to be freed

>        using the newly introduced rte_hash_free_key_with_position API.

>        It needs to be called once all the readers have stopped using

>        the key store element. How this is determined is outside

>        the scope of this patch (RCU is one such mechanism that the

>        application can use).

> 

>     4) Finally, a lock free reader-writer concurrency flag is added

>        to enable this feature at run time.

> 

>     Performance numbers:

>     Scenario: Equal number of writer/reader threads for concurrent

> 	      writers and readers. For readers only test, the

>               entries are added upfront.

> 

>     Current code:

> 	Cores	Lookup     Lookup

> 		with add

> 	2	474	   246

> 	4	935        579

> 	6	1387       1048

> 	8	1766       1480

> 	10	2119       1951

> 	12	2546       2441

> 

>     With this patch:

> 	Cores	Lookup     Lookup

> 		with add

> 	2	291	   211

> 	4	297	   196

> 	6	304	   198

> 	8	309	   202

> 	10	315	   205

> 	12	319	   209

> 

> Honnappa Nagarahalli (4):

>   hash: correct key store element alignment

>   hash: add memory ordering to avoid race conditions

>   hash: fix rw concurrency while moving keys

>   hash: enable lock-free reader-writer concurrency

> 

>  lib/librte_hash/rte_cuckoo_hash.c    | 445 +++++++++++++++++++++++++-------

> ---

>  lib/librte_hash/rte_cuckoo_hash.h    |   6 +-

>  lib/librte_hash/rte_hash.h           |  63 ++++-

>  lib/librte_hash/rte_hash_version.map |   7 +

>  4 files changed, 393 insertions(+), 128 deletions(-)

> 

> --

> 2.7.4

Wang, Yipeng1 Sept. 27, 2018, 11:45 p.m. UTC | #3

Hi Honnappa,

Reply inlined:

>-----Original Message-----

>

>    Currently, reader-writer concurrency problems in rte_hash are

>    addressed using reader-writer locks. Use of reader-writer locks

>    results in following issues:

>

>    1) In many of the use cases for the hash table, writer threads

>       are running on control plane. If the writer is preempted while

>       holding the lock, it will block the readers for an extended period

>       resulting in packet drops. This problem seems to apply for platforms

>       with transactional memory support as well because of the algorithm

>       used for rte_rwlock_write_lock_tm:

>

>       static inline void

>       rte_rwlock_write_lock_tm(rte_rwlock_t *rwl)

>       {

>            if (likely(rte_try_tm(&rwl->cnt)))

>                    return;

>            rte_rwlock_write_lock(rwl);

>       }

>

>       i.e. there is a posibility of using rte_rwlock_write_lock in

>       failure cases.

[Wang, Yipeng]  In our test, TSX failure happens very rarely on a TSX platform. But we agree
that without TSX, the current rte_rwlock implementation may make the writer to
hold a lock for a period of time.

>    2) Reader-writer lock based solution does not address the following

>       issue.

>       rte_hash_lookup_xxx APIs return the index of the element in

>       the key store. Application(reader) can use that index to reference

>       other data structures in its scope. Because of this, the

>       index should not be freed till the application completes

>       using the index.

[Wang, Yipeng]  I agree on this use case. But I think we should provide new API functions for deletion
to users who want this behavior,
without changing the meaning of current API if that is possible.

>    Current code:

>	Cores	Lookup     Lookup

>		with add

>	2	474	   246

>	4	935        579

>	6	1387       1048

>	8	1766       1480

>	10	2119       1951

>	12	2546       2441

>

>    With this patch:

>	Cores	Lookup     Lookup

>		with add

>	2	291	   211

>	4	297	   196

>	6	304	   198

>	8	309	   202

>	10	315	   205

>	12	319	   209

>

[Wang, Yipeng] It would be good if you could provide the platform information on these results.

Another comment is as you know we also have a new extension to rte_hash to enable extendable
buckets and partial-key hashing. Thanks for the comments btw. Could you check if your lockless
scheme also applies to those extensions?

Honnappa Nagarahalli Sept. 28, 2018, 9:11 p.m. UTC | #4

> 

> Hi Honnappa,

> 

> Reply inlined:

Hi Yipeng,
Thank you so much for reviewing.

> 

> >-----Original Message-----

> >

> >    Currently, reader-writer concurrency problems in rte_hash are

> >    addressed using reader-writer locks. Use of reader-writer locks

> >    results in following issues:

> >

> >    1) In many of the use cases for the hash table, writer threads

> >       are running on control plane. If the writer is preempted while

> >       holding the lock, it will block the readers for an extended period

> >       resulting in packet drops. This problem seems to apply for platforms

> >       with transactional memory support as well because of the algorithm

> >       used for rte_rwlock_write_lock_tm:

> >

> >       static inline void

> >       rte_rwlock_write_lock_tm(rte_rwlock_t *rwl)

> >       {

> >            if (likely(rte_try_tm(&rwl->cnt)))

> >                    return;

> >            rte_rwlock_write_lock(rwl);

> >       }

> >

> >       i.e. there is a posibility of using rte_rwlock_write_lock in

> >       failure cases.

> [Wang, Yipeng]  In our test, TSX failure happens very rarely on a TSX

> platform. But we agree that without TSX, the current rte_rwlock

> implementation may make the writer to hold a lock for a period of time.

> 

> >    2) Reader-writer lock based solution does not address the following

> >       issue.

> >       rte_hash_lookup_xxx APIs return the index of the element in

> >       the key store. Application(reader) can use that index to reference

> >       other data structures in its scope. Because of this, the

> >       index should not be freed till the application completes

> >       using the index.

> [Wang, Yipeng]  I agree on this use case. But I think we should provide new

> API functions for deletion to users who want this behavior, without

> changing the meaning of current API if that is possible.

In the lock-free algorithm, the rte_hash_delete API will not free the index. The new API rte_hash_free will free the index. The solution for the algorithm with rw locks needs to be thought about.

> 

> >    Current code:

> >	Cores	Lookup     Lookup

> >		with add

> >	2	474	   246

> >	4	935        579

> >	6	1387       1048

> >	8	1766       1480

> >	10	2119       1951

> >	12	2546       2441

> >

> >    With this patch:

> >	Cores	Lookup     Lookup

> >		with add

> >	2	291	   211

> >	4	297	   196

> >	6	304	   198

> >	8	309	   202

> >	10	315	   205

> >	12	319	   209

> >

> [Wang, Yipeng] It would be good if you could provide the platform

> information on these results.

Apologies, I should have done that. The machine I am using is: Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz, 64G memory. This is a hacked test case which is not upstreamed. In the case of 'Lookup with add' - I had equal number of threads calling 'rte_hash_add' and 'rte_hash_lookup'. In the case of 'Lookup' - a set of entries were added and all the threads called 'rte_hash_lookup'. Note that these are the numbers without htm. We have created another test case which I will upstream as next version of this patch. I will publish the numbers with that test case. So, you should be able to reproduce the numbers with that test case.

> 

> Another comment is as you know we also have a new extension to rte_hash

> to enable extendable buckets and partial-key hashing. Thanks for the

> comments btw. Could you check if your lockless scheme also applies to

> those extensions?

Thank you for reminding me on this. I thought I had covered everything. On a relook, I have missed few key issues. I will reply on the other email thread.

Wang, Yipeng1 Oct. 2, 2018, 12:30 a.m. UTC | #5

>>

>> Another comment is as you know we also have a new extension to rte_hash

>> to enable extendable buckets and partial-key hashing. Thanks for the

>> comments btw. Could you check if your lockless scheme also applies to

>> those extensions?

>Thank you for reminding me on this. I thought I had covered everything. On a relook, I have missed few key issues. I will reply on the

>other email thread.

[Wang, Yipeng] Hi, Honnappa, would like to rebase your V2 on the V4 patch I submitted for the ext table and partial-key hashing (http://patchwork.dpdk.org/cover/45620/). I will have V5 sent soon, so if it is in time, please rebase on V5. If you have any difficulty to do so we can work together on resolving the conflicts. Let's resolve the conflicts so that the final merge would be easier.

[0/4] Address reader-writer concurrency in rte_hash

Message

Comments