Message ID | 20221012215931.3896-10-elliott@hpe.com |
---|---|
State | New |
Headers | show |
Series | crypto: x86 - fix RCU stalls | expand |
On Thu, Oct 13, 2022 at 3:48 PM Elliott, Robert (Servers) <elliott@hpe.com> wrote: > Perhaps we should declare a time goal like "30 us," measure the actual > speed of each algorithm with a tcrypt speed test, and calculate the > nominal value assuming some slow x86 CPU core speed? Sure, pick something reasonable with good margin for a reasonable CPU. It doesn't have to be perfect, but just vaguely right for supported hardware. > That could be further adjusted at run-time based on the supposed > minimum CPU frequency (e.g., as reported in > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq). Oh no, please no. Not another runtime knob. That also will make the loop less efficient.
> -----Original Message----- > From: Jason A. Donenfeld <Jason@zx2c4.com> > Sent: Thursday, October 13, 2022 8:27 PM > Subject: Re: [PATCH v2 09/19] crypto: x86 - use common macro for FPU > limit > > On Thu, Oct 13, 2022 at 3:48 PM Elliott, Robert (Servers) > <elliott@hpe.com> wrote: > > Perhaps we should declare a time goal like "30 us," measure the actual > > speed of each algorithm with a tcrypt speed test, and calculate the > > nominal value assuming some slow x86 CPU core speed? > > Sure, pick something reasonable with good margin for a reasonable CPU. > It doesn't have to be perfect, but just vaguely right for supported > hardware. > > > That could be further adjusted at run-time based on the supposed > > minimum CPU frequency (e.g., as reported in > > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq). > > Oh no, please no. Not another runtime knob. That also will make the > loop less efficient. Here's some stats measuring the time in CPU cycles between kernel_fpu_begin() and kernel_fpu_end() for every x86 crypto module using those function calls. This is before any patches to enforce any new limits. Driver boot tcrypt-sweep average ====== ==== ============ ======= aegis128_aesni 6240 | 8214 433 aesni_intel 22218 | 150558 68 aria_aesni_avx_x86_64 0 > 95560 1282 camellia_aesni_avx2 52300 52300 4300 camellia_aesni_avx_x86_64 20920 20920 5915 camellia_x86_64 0 0 0 cast5_avx_x86_64 41854 | 108996 6602 cast6_avx_x86_64 39270 | 119476 10596 chacha_x86_64 3516 | 58112 349 crc32c_intel 1458 | 2702 235 crc32_pclmul 1610 | 3130 210 crct10dif_pclmul 1928 | 2096 82 ghash_clmulni_intel 9154 | 56632 336 libblake2s_x86_64 7514 7514 897 nhpoly1305_avx2 1360 | 5408 301 poly1305_x86_64 20656 | 21688 409 polyval_clmulni 13972 13972 34 serpent_avx2 45686 | 74824 4185 serpent_avx_x86_64 47436 47436 7120 serpent_sse2_x86_64 38492 38492 7400 sha1_ssse3 20950 | 38310 512 sha256_ssse3 46554 | 57162 1201 sha512_ssse3 157051800 157051800 167728 sm3_avx_x86_64 82372 82372 2017 sm4_aesni_avx_x86_64 66350 66350 2019 twofish_avx_x86_64 104598 | 163894 6633 twofish_x86_64_3way 0 0 0 Comparing a few of the hash functions with tcrypt test 16 (4 KiB of data with 1 update) shows a 35x difference from the fastest to slowest: crc32c 695 cycles/operation crct10dif 2197 sha1-avx2 8825 sha224-avx2 24816 sha256-avx2 21179 sha384-avx2 14939 sha512-avx2 14584 Test notes ========== Measurement points: - after booting, with - CONFIG_MODULE_SIG_SHA512=y (use SHA-512 for module signing) - CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y (compares results with generic module during init) - # CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set (run self-tests during module load) - after sweeping through tcrypt test modes 1 to 999 - except 0, 300, and 400 which run combinations of the others - measured on a system with Intel Cascade Lake CPUs at 2.2 GHz This run did not report any RCU stalls. The hash function is the main problem, subjected to huge sizes during module signature checking. sha1 or sha256 would face the same problem if they had been selected. The self-tests are limited to 2 * PAGE_SIZE so don't stress the drivers anywhere near as much as booting. This run did include the tcrypt patch to call cond_resched during speed tests, so the speed test induced problem is out of the way. aria_aesni_avx_x86_64 0 > 95560 1282 This run didn't have the patch to load aria based on the device table, so it wasn't loaded until tcrypt asked for it. camellia_x86_64 0 0 0 twofish_x86_64_3way 0 0 0 Those use the ecb_cbc_helper macros, but pass along -1 to not use kernel_fpu_begin/end, so the debug instrumentation is there but unused. Next steps ========== I'll try to add a test with long data, and work on scaling the loops based on relative performance (e.g., if sha512 needs 4 KiB, then crc32c should be fine with 80 KiB).
diff --git a/arch/x86/crypto/blake2s-glue.c b/arch/x86/crypto/blake2s-glue.c index aaba21230528..3054ee7fa219 100644 --- a/arch/x86/crypto/blake2s-glue.c +++ b/arch/x86/crypto/blake2s-glue.c @@ -16,6 +16,8 @@ #include <asm/processor.h> #include <asm/simd.h> +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + asmlinkage void blake2s_compress_ssse3(struct blake2s_state *state, const u8 *block, const size_t nblocks, const u32 inc); @@ -29,8 +31,7 @@ static __ro_after_init DEFINE_STATIC_KEY_FALSE(blake2s_use_avx512); void blake2s_compress(struct blake2s_state *state, const u8 *block, size_t nblocks, const u32 inc) { - /* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(SZ_4K / BLAKE2S_BLOCK_SIZE < 8); + BUILD_BUG_ON(FPU_BYTES / BLAKE2S_BLOCK_SIZE < 8); if (!static_branch_likely(&blake2s_use_ssse3) || !may_use_simd()) { blake2s_compress_generic(state, block, nblocks, inc); @@ -39,7 +40,7 @@ void blake2s_compress(struct blake2s_state *state, const u8 *block, do { const size_t blocks = min_t(size_t, nblocks, - SZ_4K / BLAKE2S_BLOCK_SIZE); + FPU_BYTES / BLAKE2S_BLOCK_SIZE); kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c index 7b3a1cf0984b..0d7e172862db 100644 --- a/arch/x86/crypto/chacha_glue.c +++ b/arch/x86/crypto/chacha_glue.c @@ -15,6 +15,8 @@ #include <linux/sizes.h> #include <asm/simd.h> +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + asmlinkage void chacha_block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, unsigned int len, int nrounds); asmlinkage void chacha_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src, @@ -147,7 +149,7 @@ void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, return chacha_crypt_generic(state, dst, src, bytes, nrounds); do { - unsigned int todo = min_t(unsigned int, bytes, SZ_4K); + unsigned int todo = min(bytes, FPU_BYTES); kernel_fpu_begin(); chacha_dosimd(state, dst, src, todo, nrounds); diff --git a/arch/x86/crypto/nhpoly1305-avx2-glue.c b/arch/x86/crypto/nhpoly1305-avx2-glue.c index 8ea5ab0f1ca7..59615ae95e86 100644 --- a/arch/x86/crypto/nhpoly1305-avx2-glue.c +++ b/arch/x86/crypto/nhpoly1305-avx2-glue.c @@ -13,6 +13,8 @@ #include <linux/sizes.h> #include <asm/simd.h> +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + asmlinkage void nh_avx2(const u32 *key, const u8 *message, size_t message_len, u8 hash[NH_HASH_BYTES]); @@ -30,7 +32,7 @@ static int nhpoly1305_avx2_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen); do { - unsigned int n = min_t(unsigned int, srclen, SZ_4K); + unsigned int n = min(srclen, FPU_BYTES); kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_avx2); diff --git a/arch/x86/crypto/nhpoly1305-sse2-glue.c b/arch/x86/crypto/nhpoly1305-sse2-glue.c index 2b353d42ed13..bf91c375821a 100644 --- a/arch/x86/crypto/nhpoly1305-sse2-glue.c +++ b/arch/x86/crypto/nhpoly1305-sse2-glue.c @@ -13,6 +13,8 @@ #include <linux/sizes.h> #include <asm/simd.h> +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + asmlinkage void nh_sse2(const u32 *key, const u8 *message, size_t message_len, u8 hash[NH_HASH_BYTES]); @@ -30,7 +32,7 @@ static int nhpoly1305_sse2_update(struct shash_desc *desc, return crypto_nhpoly1305_update(desc, src, srclen); do { - unsigned int n = min_t(unsigned int, srclen, SZ_4K); + unsigned int n = min(srclen, FPU_BYTES); kernel_fpu_begin(); crypto_nhpoly1305_update_helper(desc, src, n, _nh_sse2); diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c index 1dfb8af48a3c..3764301bdf1b 100644 --- a/arch/x86/crypto/poly1305_glue.c +++ b/arch/x86/crypto/poly1305_glue.c @@ -15,20 +15,24 @@ #include <asm/intel-family.h> #include <asm/simd.h> +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + asmlinkage void poly1305_init_x86_64(void *ctx, const u8 key[POLY1305_BLOCK_SIZE]); asmlinkage void poly1305_blocks_x86_64(void *ctx, const u8 *inp, - const size_t len, const u32 padbit); + const unsigned int len, + const u32 padbit); asmlinkage void poly1305_emit_x86_64(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], const u32 nonce[4]); asmlinkage void poly1305_emit_avx(void *ctx, u8 mac[POLY1305_DIGEST_SIZE], const u32 nonce[4]); -asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, const size_t len, - const u32 padbit); -asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, const size_t len, - const u32 padbit); +asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, + const unsigned int len, const u32 padbit); +asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, + const unsigned int len, const u32 padbit); asmlinkage void poly1305_blocks_avx512(void *ctx, const u8 *inp, - const size_t len, const u32 padbit); + const unsigned int len, + const u32 padbit); static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_avx); static __ro_after_init DEFINE_STATIC_KEY_FALSE(poly1305_use_avx2); @@ -86,14 +90,13 @@ static void poly1305_simd_init(void *ctx, const u8 key[POLY1305_BLOCK_SIZE]) poly1305_init_x86_64(ctx, key); } -static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, +static void poly1305_simd_blocks(void *ctx, const u8 *inp, unsigned int len, const u32 padbit) { struct poly1305_arch_internal *state = ctx; - /* SIMD disables preemption, so relax after processing each page. */ - BUILD_BUG_ON(SZ_4K < POLY1305_BLOCK_SIZE || - SZ_4K % POLY1305_BLOCK_SIZE); + BUILD_BUG_ON(FPU_BYTES < POLY1305_BLOCK_SIZE || + FPU_BYTES % POLY1305_BLOCK_SIZE); if (!static_branch_likely(&poly1305_use_avx) || (len < (POLY1305_BLOCK_SIZE * 18) && !state->is_base2_26) || @@ -104,7 +107,7 @@ static void poly1305_simd_blocks(void *ctx, const u8 *inp, size_t len, } do { - const size_t bytes = min_t(size_t, len, SZ_4K); + const unsigned int bytes = min(len, FPU_BYTES); kernel_fpu_begin(); if (IS_ENABLED(CONFIG_AS_AVX512) && static_branch_likely(&poly1305_use_avx512)) diff --git a/arch/x86/crypto/polyval-clmulni_glue.c b/arch/x86/crypto/polyval-clmulni_glue.c index b7664d018851..2502964afef6 100644 --- a/arch/x86/crypto/polyval-clmulni_glue.c +++ b/arch/x86/crypto/polyval-clmulni_glue.c @@ -29,6 +29,8 @@ #define NUM_KEY_POWERS 8 +#define FPU_BYTES 4096U /* avoid kernel_fpu_begin/end scheduler/rcu stalls */ + struct polyval_tfm_ctx { /* * These powers must be in the order h^8, ..., h^1. @@ -123,8 +125,7 @@ static int polyval_x86_update(struct shash_desc *desc, } while (srclen >= POLYVAL_BLOCK_SIZE) { - /* Allow rescheduling every 4K bytes. */ - nblocks = min(srclen, 4096U) / POLYVAL_BLOCK_SIZE; + nblocks = min(srclen, FPU_BYTES) / POLYVAL_BLOCK_SIZE; internal_polyval_update(tctx, src, nblocks, dctx->buffer); srclen -= nblocks * POLYVAL_BLOCK_SIZE; src += nblocks * POLYVAL_BLOCK_SIZE;
Use a common macro name (FPU_BYTES) for the limit of the number of bytes processed within kernel_fpu_begin and kernel_fpu_end rather than using SZ_4K (which is a signed value), or a magic value of 4096U. Use unsigned int rather than size_t for some of the arguments to avoid typecasting for the min() macro. Signed-off-by: Robert Elliott <elliott@hpe.com> --- arch/x86/crypto/blake2s-glue.c | 7 ++++--- arch/x86/crypto/chacha_glue.c | 4 +++- arch/x86/crypto/nhpoly1305-avx2-glue.c | 4 +++- arch/x86/crypto/nhpoly1305-sse2-glue.c | 4 +++- arch/x86/crypto/poly1305_glue.c | 25 ++++++++++++++----------- arch/x86/crypto/polyval-clmulni_glue.c | 5 +++-- 6 files changed, 30 insertions(+), 19 deletions(-)