[0/2] linux-user: Change mmap_lock to rwlock

Message ID	20180621173635.21537-1-richard.henderson@linaro.org
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 2001:4830:134:3::11 as permitted sender) client-ip=2001:4830:134:3::11; From: Richard Henderson <richard.henderson@linaro.org> To: qemu-devel@nongnu.org Date: Thu, 21 Jun 2018 07:36:33 -1000 Message-Id: <20180621173635.21537-1-richard.henderson@linaro.org> Subject: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock Precedence: list Cc: cota@braap.org, laurent@vivier.eu, qemu-arm@nongnu.org Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>
Series	linux-user: Change mmap_lock to rwlock \| expand [0/2] linux-user: Change mmap_lock to rwlock [1/2] exec: Split mmap_lock to mmap_rdlock/mmap_wrlock [2/2] linux-user: Use pthread_rwlock_t for mmap_rd/wrlock

Message ID

20180621173635.21537-1-richard.henderson@linaro.org

Headers

Received-SPF: pass (google.com: domain of
	qemu-devel-bounces+patch=linaro.org@nongnu.org designates
	2001:4830:134:3::11 as permitted sender)
	client-ip=2001:4830:134:3::11; 
From: Richard Henderson <richard.henderson@linaro.org>
To: qemu-devel@nongnu.org
Date: Thu, 21 Jun 2018 07:36:33 -1000
Message-Id: <20180621173635.21537-1-richard.henderson@linaro.org>
Subject: [Qemu-devel] [PATCH 0/2] linux-user: Change mmap_lock to rwlock
Precedence: list
Cc: cota@braap.org, laurent@vivier.eu, qemu-arm@nongnu.org
Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org
Sender: "Qemu-devel" <qemu-devel-bounces+patch=linaro.org@nongnu.org>

Series

linux-user: Change mmap_lock to rwlock | expand

Message

Richard Henderson June 21, 2018, 5:36 p.m. UTC

The objective here is to avoid serializing the first-fault
and no-fault aarch64 sve loads.

When I first thought of this, we still had tb_lock protecting
code generation and merely needed mmap_lock to prevent mapping
changes while doing so.  Now (or very soon), tb_lock is gone
and user-only code generation dual-purposes the mmap_lock for
the code generation lock.

There are two other legacy users of mmap_lock within linux-user,
for ppc and mips.  I believe these are dead code from before these
ports were converted to use tcg atomics.

Thoughts?


r~


Based-on: <20180621143715.27176-1-richard.henderson@linaro.org> [tcg-next]
Based-on: <20180621015359.12018-1-richard.henderson@linaro.org> [v5 SVE]


Richard Henderson (2):
  exec: Split mmap_lock to mmap_rdlock/mmap_wrlock
  linux-user: Use pthread_rwlock_t for mmap_rd/wrlock

 include/exec/exec-all.h    |  6 ++--
 accel/tcg/cpu-exec.c       |  8 ++---
 accel/tcg/translate-all.c  |  4 +--
 bsd-user/mmap.c            | 18 +++++++++---
 exec.c                     |  6 ++--
 linux-user/elfload.c       |  2 +-
 linux-user/mips/cpu_loop.c |  2 +-
 linux-user/mmap.c          | 60 +++++++++++++++++++++++++-------------
 linux-user/ppc/cpu_loop.c  |  2 +-
 linux-user/syscall.c       |  4 +--
 target/arm/sve_helper.c    | 10 ++-----
 target/xtensa/op_helper.c  |  2 +-
 12 files changed, 76 insertions(+), 48 deletions(-)

-- 
2.17.1

Comments

Emilio Cota June 22, 2018, 9:12 p.m. UTC | #1

On Thu, Jun 21, 2018 at 07:36:33 -1000, Richard Henderson wrote:
> The objective here is to avoid serializing the first-fault

> and no-fault aarch64 sve loads.

> 

> When I first thought of this, we still had tb_lock protecting

> code generation and merely needed mmap_lock to prevent mapping

> changes while doing so.  Now (or very soon), tb_lock is gone

> and user-only code generation dual-purposes the mmap_lock for

> the code generation lock.

> 

> There are two other legacy users of mmap_lock within linux-user,

> for ppc and mips.  I believe these are dead code from before these

> ports were converted to use tcg atomics.

> 

> Thoughts?

I'm curious to see how much perf could be gained. It seems that the hold
times in SVE code for readers might not be very large, which
then wouldn't let us amortize the atomic inc of the read lock
(IOW, we might not see much of a difference compared to a regular
mutex).

Are you using any benchmark that shows any perf difference?

Thanks,

		E.

Richard Henderson June 23, 2018, 3:25 p.m. UTC | #2

On 06/22/2018 02:12 PM, Emilio G. Cota wrote:
> I'm curious to see how much perf could be gained. It seems that the hold

> times in SVE code for readers might not be very large, which

> then wouldn't let us amortize the atomic inc of the read lock

> (IOW, we might not see much of a difference compared to a regular

> mutex).

In theory, the uncontended case for rwlocks is the same as a mutex.

> Are you using any benchmark that shows any perf difference?

Not so far.  Glibc has some microbenchmarks for strings, which I will try next
week, but they are not multi-threaded.  Maybe just run 4 threads of those
benchmark?

r~

Emilio Cota June 23, 2018, 6:20 p.m. UTC | #3

On Sat, Jun 23, 2018 at 08:25:52 -0700, Richard Henderson wrote:
> On 06/22/2018 02:12 PM, Emilio G. Cota wrote:

> > I'm curious to see how much perf could be gained. It seems that the hold

> > times in SVE code for readers might not be very large, which

> > then wouldn't let us amortize the atomic inc of the read lock

> > (IOW, we might not see much of a difference compared to a regular

> > mutex).

> 

> In theory, the uncontended case for rwlocks is the same as a mutex.

In the fast path, wr_lock/unlock have one more atomic than
mutex_lock/unlock. The perf difference is quite large in
microbenchmarks, e.g. changing tests/atomic_add-bench to
use pthread_mutex or pthread_rwlock_wrlock instead of
an atomic operation (this is enabled with the added -m flag):

$ taskset -c 0 perf record tests/atomic_add-bench-mutex  -d 4 -m
 Throughput:         62.05 Mops/s

$ taskset -c 0 perf record tests/atomic_add-bench-rwlock  -d 4 -m
 Throughput:         37.68 Mops/s

That said, it's unlikely to have real user-space code
(i.e. not from microbenchmarks) that would be sensitive to
the additional delay and/or lower scalability. It is common to
avoid frequent calls to mmap(2) due to potential serialization
in the kernel -- think for instance of memory allocators, they
do a few large mmap calls and then manage the memory themselves.

To double-check I ran some multi-threaded benchmarks from
Hoard[1] under qemu-linux-user, with and without the rwlock change,
and couldn't measure a significant difference.

[1] https://github.com/emeryberger/Hoard/tree/master/benchmarks

> > Are you using any benchmark that shows any perf difference?

> 

> Not so far.  Glibc has some microbenchmarks for strings, which I will try next

> week, but they are not multi-threaded.  Maybe just run 4 threads of those

> benchmark?

I'd run more threads if possible. I have access to a 64-core machine,
so ping me once you identify benchmarks that are of interest.

		Emilio