Message ID | 20230821161854.419893-1-richard.henderson@linaro.org |
---|---|
Headers | show |
Series | crypto: Provide clmul.h and host accel | expand |
On Mon, 21 Aug 2023 at 18:18, Richard Henderson <richard.henderson@linaro.org> wrote: > > Inspired by Ard Biesheuvel's RFC patches [1] for accelerating > carry-less multiply under emulation. > > Changes for v3: > * Update target/i386 ops_sse.h. > * Apply r-b. > > Changes for v2: > * Only accelerate clmul_64; keep generic helpers for other sizes. > * Drop most of the Int128 interfaces, except for clmul_64. > * Use the same acceleration format as aes-round.h. > > > r~ > > > [1] https://patchew.org/QEMU/20230601123332.3297404-1-ardb@kernel.org/ > > > Richard Henderson (19): > crypto: Add generic 8-bit carry-less multiply routines > target/arm: Use clmul_8* routines > target/s390x: Use clmul_8* routines > target/ppc: Use clmul_8* routines > crypto: Add generic 16-bit carry-less multiply routines > target/arm: Use clmul_16* routines > target/s390x: Use clmul_16* routines > target/ppc: Use clmul_16* routines > crypto: Add generic 32-bit carry-less multiply routines > target/arm: Use clmul_32* routines > target/s390x: Use clmul_32* routines > target/ppc: Use clmul_32* routines > crypto: Add generic 64-bit carry-less multiply routine > target/arm: Use clmul_64 > target/i386: Use clmul_64 > target/s390x: Use clmul_64 > target/ppc: Use clmul_64 > host/include/i386: Implement clmul.h > host/include/aarch64: Implement clmul.h > OK, I did the OpenSSL benchmark this time, using a x86_64 cross build on arm64/ThunderX2, and the speedup is 7x (\o/) Tested-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Distro qemu (no acceleration): $ qemu-x86_64 --version qemu-x86_64 version 7.2.4 (Debian 1:7.2+dfsg-7+deb12u1) $ apps/openssl speed -evp aes-128-gcm version: 3.2.0-dev built on: Mon Aug 21 17:57:37 2023 UTC options: bn(64,64) compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL -DNDEBUG CPUINFO: OPENSSL_ia32cap=0xfed8320b0fcbfffd:0x8001020c01d843a9 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes AES-128-GCM 8856.13k 13820.95k 17375.49k 16826.37k 16870.06k 17208.66k QEMU built with this series applied onto latest master: $ ~/build/qemu/build/qemu-x86_64 apps/openssl speed -evp aes-128-gcm version: 3.2.0-dev built on: Mon Aug 21 17:57:37 2023 UTC options: bn(64,64) compiler: x86_64-linux-gnu-gcc -pthread -m64 -Wa,--noexecstack -Wall -O3 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_BUILDING_OPENSSL -DNDEBUG CPUINFO: OPENSSL_ia32cap=0xfffa320b0fcbfffd:0x8041020c01dc47a9 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes AES-128-GCM 14237.01k 34176.34k 70633.13k 97372.84k 119668.74k 122049.88k
On 8/21/23 11:08, Ard Biesheuvel wrote: > OK, I did the OpenSSL benchmark this time, using a x86_64 cross build > on arm64/ThunderX2, and the speedup is 7x (\o/) Excellent, thanks. r~
Ping. Still missing r-b on patches 1, 4, 5, 8, 9, 12, 13, 18. r~ On 8/21/23 09:18, Richard Henderson wrote: > Inspired by Ard Biesheuvel's RFC patches [1] for accelerating > carry-less multiply under emulation. > > Changes for v3: > * Update target/i386 ops_sse.h. > * Apply r-b. > > Changes for v2: > * Only accelerate clmul_64; keep generic helpers for other sizes. > * Drop most of the Int128 interfaces, except for clmul_64. > * Use the same acceleration format as aes-round.h. > > > r~ > > > [1] https://patchew.org/QEMU/20230601123332.3297404-1-ardb@kernel.org/ > > > Richard Henderson (19): > crypto: Add generic 8-bit carry-less multiply routines > target/arm: Use clmul_8* routines > target/s390x: Use clmul_8* routines > target/ppc: Use clmul_8* routines > crypto: Add generic 16-bit carry-less multiply routines > target/arm: Use clmul_16* routines > target/s390x: Use clmul_16* routines > target/ppc: Use clmul_16* routines > crypto: Add generic 32-bit carry-less multiply routines > target/arm: Use clmul_32* routines > target/s390x: Use clmul_32* routines > target/ppc: Use clmul_32* routines > crypto: Add generic 64-bit carry-less multiply routine > target/arm: Use clmul_64 > target/i386: Use clmul_64 > target/s390x: Use clmul_64 > target/ppc: Use clmul_64 > host/include/i386: Implement clmul.h > host/include/aarch64: Implement clmul.h > > host/include/aarch64/host/cpuinfo.h | 1 + > host/include/aarch64/host/crypto/clmul.h | 41 +++++ > host/include/generic/host/crypto/clmul.h | 15 ++ > host/include/i386/host/cpuinfo.h | 1 + > host/include/i386/host/crypto/clmul.h | 29 ++++ > host/include/x86_64/host/crypto/clmul.h | 1 + > include/crypto/clmul.h | 83 ++++++++++ > include/qemu/cpuid.h | 3 + > target/arm/tcg/vec_internal.h | 11 -- > target/i386/ops_sse.h | 40 ++--- > crypto/clmul.c | 112 ++++++++++++++ > target/arm/tcg/mve_helper.c | 16 +- > target/arm/tcg/vec_helper.c | 102 ++----------- > target/ppc/int_helper.c | 64 ++++---- > target/s390x/tcg/vec_int_helper.c | 186 ++++++++++------------- > util/cpuinfo-aarch64.c | 4 +- > util/cpuinfo-i386.c | 1 + > crypto/meson.build | 9 +- > 18 files changed, 434 insertions(+), 285 deletions(-) > create mode 100644 host/include/aarch64/host/crypto/clmul.h > create mode 100644 host/include/generic/host/crypto/clmul.h > create mode 100644 host/include/i386/host/crypto/clmul.h > create mode 100644 host/include/x86_64/host/crypto/clmul.h > create mode 100644 include/crypto/clmul.h > create mode 100644 crypto/clmul.c >