Message ID | 20180621173635.21537-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | linux-user: Change mmap_lock to rwlock | expand |
On Thu, Jun 21, 2018 at 07:36:33 -1000, Richard Henderson wrote: > The objective here is to avoid serializing the first-fault > and no-fault aarch64 sve loads. > > When I first thought of this, we still had tb_lock protecting > code generation and merely needed mmap_lock to prevent mapping > changes while doing so. Now (or very soon), tb_lock is gone > and user-only code generation dual-purposes the mmap_lock for > the code generation lock. > > There are two other legacy users of mmap_lock within linux-user, > for ppc and mips. I believe these are dead code from before these > ports were converted to use tcg atomics. > > Thoughts? I'm curious to see how much perf could be gained. It seems that the hold times in SVE code for readers might not be very large, which then wouldn't let us amortize the atomic inc of the read lock (IOW, we might not see much of a difference compared to a regular mutex). Are you using any benchmark that shows any perf difference? Thanks, E.
On 06/22/2018 02:12 PM, Emilio G. Cota wrote: > I'm curious to see how much perf could be gained. It seems that the hold > times in SVE code for readers might not be very large, which > then wouldn't let us amortize the atomic inc of the read lock > (IOW, we might not see much of a difference compared to a regular > mutex). In theory, the uncontended case for rwlocks is the same as a mutex. > Are you using any benchmark that shows any perf difference? Not so far. Glibc has some microbenchmarks for strings, which I will try next week, but they are not multi-threaded. Maybe just run 4 threads of those benchmark? r~
On Sat, Jun 23, 2018 at 08:25:52 -0700, Richard Henderson wrote: > On 06/22/2018 02:12 PM, Emilio G. Cota wrote: > > I'm curious to see how much perf could be gained. It seems that the hold > > times in SVE code for readers might not be very large, which > > then wouldn't let us amortize the atomic inc of the read lock > > (IOW, we might not see much of a difference compared to a regular > > mutex). > > In theory, the uncontended case for rwlocks is the same as a mutex. In the fast path, wr_lock/unlock have one more atomic than mutex_lock/unlock. The perf difference is quite large in microbenchmarks, e.g. changing tests/atomic_add-bench to use pthread_mutex or pthread_rwlock_wrlock instead of an atomic operation (this is enabled with the added -m flag): $ taskset -c 0 perf record tests/atomic_add-bench-mutex -d 4 -m Throughput: 62.05 Mops/s $ taskset -c 0 perf record tests/atomic_add-bench-rwlock -d 4 -m Throughput: 37.68 Mops/s That said, it's unlikely to have real user-space code (i.e. not from microbenchmarks) that would be sensitive to the additional delay and/or lower scalability. It is common to avoid frequent calls to mmap(2) due to potential serialization in the kernel -- think for instance of memory allocators, they do a few large mmap calls and then manage the memory themselves. To double-check I ran some multi-threaded benchmarks from Hoard[1] under qemu-linux-user, with and without the rwlock change, and couldn't measure a significant difference. [1] https://github.com/emeryberger/Hoard/tree/master/benchmarks > > Are you using any benchmark that shows any perf difference? > > Not so far. Glibc has some microbenchmarks for strings, which I will try next > week, but they are not multi-threaded. Maybe just run 4 threads of those > benchmark? I'd run more threads if possible. I have access to a 64-core machine, so ping me once you identify benchmarks that are of interest. Emilio