Message ID | 20230819010218.192706-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | crypto: Provide clmul.h and host accel | expand |
On Sat, 19 Aug 2023 at 03:02, Richard Henderson <richard.henderson@linaro.org> wrote: > > Inspired by Ard Biesheuvel's RFC patches [1] for accelerating > carry-less multiply under emulation. > > Changes for v2: > * Only accelerate clmul_64; keep generic helpers for other sizes. > * Drop most of the Int128 interfaces, except for clmul_64. > * Use the same acceleration format as aes-round.h. > > > r~ > > > [1] https://patchew.org/QEMU/20230601123332.3297404-1-ardb@kernel.org/ > > Richard Henderson (18): > crypto: Add generic 8-bit carry-less multiply routines > target/arm: Use clmul_8* routines > target/s390x: Use clmul_8* routines > target/ppc: Use clmul_8* routines > crypto: Add generic 16-bit carry-less multiply routines > target/arm: Use clmul_16* routines > target/s390x: Use clmul_16* routines > target/ppc: Use clmul_16* routines > crypto: Add generic 32-bit carry-less multiply routines > target/arm: Use clmul_32* routines > target/s390x: Use clmul_32* routines > target/ppc: Use clmul_32* routines > crypto: Add generic 64-bit carry-less multiply routine > target/arm: Use clmul_64 > target/s390x: Use clmul_64 > target/ppc: Use clmul_64 > host/include/i386: Implement clmul.h > host/include/aarch64: Implement clmul.h > I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still passes all its crypto selftests when running under TCG emulation on a TX2 arm64 host, so Tested-by: Ard Biesheuvel <ardb@kernel.org> for the series. Thanks, Ard.
On 8/21/23 07:57, Ard Biesheuvel wrote: >> Richard Henderson (18): >> crypto: Add generic 8-bit carry-less multiply routines >> target/arm: Use clmul_8* routines >> target/s390x: Use clmul_8* routines >> target/ppc: Use clmul_8* routines >> crypto: Add generic 16-bit carry-less multiply routines >> target/arm: Use clmul_16* routines >> target/s390x: Use clmul_16* routines >> target/ppc: Use clmul_16* routines >> crypto: Add generic 32-bit carry-less multiply routines >> target/arm: Use clmul_32* routines >> target/s390x: Use clmul_32* routines >> target/ppc: Use clmul_32* routines >> crypto: Add generic 64-bit carry-less multiply routine >> target/arm: Use clmul_64 >> target/s390x: Use clmul_64 >> target/ppc: Use clmul_64 >> host/include/i386: Implement clmul.h >> host/include/aarch64: Implement clmul.h >> > > I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still > passes all its crypto selftests when running under TCG emulation on a > TX2 arm64 host, so > > Tested-by: Ard Biesheuvel <ardb@kernel.org> Oh, whoops. What's missing here? Any target/i386 changes. r~
On Mon, 21 Aug 2023 at 17:15, Richard Henderson <richard.henderson@linaro.org> wrote: > > On 8/21/23 07:57, Ard Biesheuvel wrote: > >> Richard Henderson (18): > >> crypto: Add generic 8-bit carry-less multiply routines > >> target/arm: Use clmul_8* routines > >> target/s390x: Use clmul_8* routines > >> target/ppc: Use clmul_8* routines > >> crypto: Add generic 16-bit carry-less multiply routines > >> target/arm: Use clmul_16* routines > >> target/s390x: Use clmul_16* routines > >> target/ppc: Use clmul_16* routines > >> crypto: Add generic 32-bit carry-less multiply routines > >> target/arm: Use clmul_32* routines > >> target/s390x: Use clmul_32* routines > >> target/ppc: Use clmul_32* routines > >> crypto: Add generic 64-bit carry-less multiply routine > >> target/arm: Use clmul_64 > >> target/s390x: Use clmul_64 > >> target/ppc: Use clmul_64 > >> host/include/i386: Implement clmul.h > >> host/include/aarch64: Implement clmul.h > >> > > > > I didn't re-run the OpenSSL benchmark, but the x86 Linux kernel still > > passes all its crypto selftests when running under TCG emulation on a > > TX2 arm64 host, so > > > > Tested-by: Ard Biesheuvel <ardb@kernel.org> > > Oh, whoops. What's missing here? Any target/i386 changes. > Ah yes - I hadn't spotted that. The below seems to do the trick. --- a/target/i386/ops_sse.h +++ b/target/i386/ops_sse.h @@ -2156,7 +2156,10 @@ void glue(helper_pclmulqdq, SUFFIX)(CPUX86State *env, Reg *d, Reg *v, Reg *s, for (i = 0; i < 1 << SHIFT; i += 2) { a = v->Q(((ctrl & 1) != 0) + i); b = s->Q(((ctrl & 16) != 0) + i); - clmulq(&d->Q(i), &d->Q(i + 1), a, b); + + Int128 r = clmul_64(a, b); + d->Q(i) = int128_getlo(r); + d->Q(i + 1) = int128_gethi(r); } } [and the #include added and clmulq() dropped] I did a quick RFC4106 benchmark with tcrypt (which doesn't speed up as much as OpenSSL but it is a bit of a hassle cross-rebuilding that) no acceleration: tcrypt: test 7 (160 bit key, 8192 byte blocks): 1547 operations in 1 seconds (12673024 bytes) AES only: tcrypt: test 7 (160 bit key, 8192 byte blocks): 1679 operations in 1 seconds (13754368 bytes) AES and PMULL tcrypt: test 7 (160 bit key, 8192 byte blocks): 3298 operations in 1 seconds (27017216 bytes)