[RFC,v2,0/7] crypto: HCTR2 support

Message ID	20220210232812.798387-1-nhuck@google.com
Headers	show Return-Path: <linux-crypto-owner@kernel.org> Date: Thu, 10 Feb 2022 23:28:05 +0000 Message-Id: <20220210232812.798387-1-nhuck@google.com> Mime-Version: 1.0 Subject: [RFC PATCH v2 0/7] crypto: HCTR2 support From: Nathan Huckleberry <nhuck@google.com> To: linux-crypto@vger.kernel.org Cc: Herbert Xu <herbert@gondor.apana.org.au>, "David S. Miller" <davem@davemloft.net>, linux-arm-kernel@lists.infradead.org, Paul Crowley <paulcrowley@google.com>, Eric Biggers <ebiggers@kernel.org>, Sami Tolvanen <samitolvanen@google.com>, Ard Biesheuvel <ardb@kernel.org>, Nathan Huckleberry <nhuck@google.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk
Series	crypto: HCTR2 support \| expand [RFC,v2,0/7] crypto: HCTR2 support [RFC,v2,2/7] crypto: polyval - Add POLYVAL support [RFC,v2,5/7] crypto: arm64/aes-xctr: Add accelerated implementation of XCTR [RFC,v2,7/7] crypto: arm64/polyval: Add PMULL accelerated implementation of POLYVAL

Message ID

20220210232812.798387-1-nhuck@google.com

Headers

Date: Thu, 10 Feb 2022 23:28:05 +0000
Message-Id: <20220210232812.798387-1-nhuck@google.com>
Mime-Version: 1.0
Subject: [RFC PATCH v2 0/7] crypto: HCTR2 support
From: Nathan Huckleberry <nhuck@google.com>
To: linux-crypto@vger.kernel.org
Cc: Herbert Xu <herbert@gondor.apana.org.au>,
        "David S. Miller" <davem@davemloft.net>,
        linux-arm-kernel@lists.infradead.org,
        Paul Crowley <paulcrowley@google.com>,
        Eric Biggers <ebiggers@kernel.org>,
        Sami Tolvanen <samitolvanen@google.com>,
        Ard Biesheuvel <ardb@kernel.org>,
        Nathan Huckleberry <nhuck@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

Series

crypto: HCTR2 support | expand

Message

Nathan Huckleberry Feb. 10, 2022, 11:28 p.m. UTC

HCTR2 is a length-preserving encryption mode that is efficient on
processors with instructions to accelerate AES and carryless
multiplication, e.g. x86 processors with AES-NI and CLMUL, and ARM
processors with the ARMv8 Crypto Extensions.

HCTR2 is specified in https://ia.cr/2021/1441 "Length-preserving
encryption with HCTR2" which shows that if AES is secure and HCTR2 is
instantiated with AES, then HCTR2 is secure.  Reference code and test
vectors are at https://github.com/google/hctr2.

As a length-preserving encryption mode, HCTR2 is suitable for applications
such as storage encryption where ciphertext expansion is not possible, and
thus authenticated encryption cannot be used.  Currently, such
applications usually use XTS, or in some cases Adiantum.  XTS has the
disadvantage that it is a narrow-block mode: a bitflip will only change 16
bytes in the resulting ciphertext or plaintext.  This reveals more
information to an attacker than necessary.

HCTR2 is a wide-block mode, so it provides a stronger security property: a
bitflip will change the entire message.  HCTR2 is somewhat similar to
Adiantum, which is also a wide-block mode.  However, HCTR2 is designed to
take advantage of existing crypto instructions, while Adiantum targets
devices without such hardware support.  Adiantum is also designed with
longer messages in mind, while HCTR2 is designed to be efficient even on
short messages.

The first intended use of this mode in the kernel is for the encryption of
filenames, where for efficiency reasons encryption must be fully
deterministic (only one ciphertext for each plaintext) and the existing
CBC solution leaks more information than necessary for filenames with
common prefixes.

HCTR2 uses two passes of an ε-almost-∆-universal hash function called
POLYVAL and one pass of a block cipher mode called XCTR.  POLYVAL is a
polynomial hash designed for efficiency on modern processors and was
originally specified for use in AES-GCM-SIV (RFC 8452).  XCTR mode is a
variant of CTR mode that is more efficient on little-endian machines.

This patchset adds HCTR2 to Linux's crypto API, including generic
implementations of XCTR and POLYVAL, hardware accelerated implementations
of XCTR and POLYVAL for both x86-64 and ARM64, and a templated
implementation of HCTR2.

Nathan Huckleberry (7):
  crypto: xctr - Add XCTR support
  crypto: polyval - Add POLYVAL support
  crypto: hctr2 - Add HCTR2 support
  crypto: x86/aesni-xctr: Add accelerated implementation of XCTR
  crypto: arm64/aes-xctr: Add accelerated implementation of XCTR
  crypto: x86/polyval: Add PCLMULQDQ accelerated implementation of
    POLYVAL
  crypto: arm64/polyval: Add PMULL accelerated implementation of POLYVAL

 arch/arm64/crypto/Kconfig                |   11 +-
 arch/arm64/crypto/Makefile               |    3 +
 arch/arm64/crypto/aes-glue.c             |   72 +-
 arch/arm64/crypto/aes-modes.S            |  130 ++
 arch/arm64/crypto/polyval-ce-core.S      |  405 ++++++
 arch/arm64/crypto/polyval-ce-glue.c      |  365 ++++++
 arch/x86/crypto/Makefile                 |    5 +-
 arch/x86/crypto/aes_xctrby8_avx-x86_64.S |  529 ++++++++
 arch/x86/crypto/aesni-intel_asm.S        |   70 +
 arch/x86/crypto/aesni-intel_glue.c       |   89 ++
 arch/x86/crypto/polyval-clmulni_asm.S    |  414 ++++++
 arch/x86/crypto/polyval-clmulni_glue.c   |  365 ++++++
 crypto/Kconfig                           |   38 +
 crypto/Makefile                          |    3 +
 crypto/hctr2.c                           |  532 ++++++++
 crypto/polyval-generic.c                 |  199 +++
 crypto/tcrypt.c                          |   10 +
 crypto/testmgr.c                         |   20 +
 crypto/testmgr.h                         | 1500 ++++++++++++++++++++++
 crypto/xctr.c                            |  193 +++
 include/crypto/polyval.h                 |   22 +
 21 files changed, 4970 insertions(+), 5 deletions(-)
 create mode 100644 arch/arm64/crypto/polyval-ce-core.S
 create mode 100644 arch/arm64/crypto/polyval-ce-glue.c
 create mode 100644 arch/x86/crypto/aes_xctrby8_avx-x86_64.S
 create mode 100644 arch/x86/crypto/polyval-clmulni_asm.S
 create mode 100644 arch/x86/crypto/polyval-clmulni_glue.c
 create mode 100644 crypto/hctr2.c
 create mode 100644 crypto/polyval-generic.c
 create mode 100644 crypto/xctr.c
 create mode 100644 include/crypto/polyval.h

Comments

Eric Biggers Feb. 16, 2022, 11 p.m. UTC | #1

On Thu, Feb 10, 2022 at 11:28:06PM +0000, Nathan Huckleberry wrote:
> Add a generic implementation of XCTR mode as a template.  XCTR is a
> blockcipher mode similar to CTR mode.  XCTR uses XORs and little-endian
> addition rather than big-endian arithmetic which has two advantages:  It
> is slightly faster on little-endian CPUs and it is less likely to be
> implemented incorrect since integer overflows are not possible on
> practical input sizes.  XCTR is used as a component to implement HCTR2.
> 
> More information on XCTR mode can be found in the HCTR2 paper:
> https://eprint.iacr.org/2021/1441.pdf
> 
> Signed-off-by: Nathan Huckleberry <nhuck@google.com>
> 
> Changes since v1:
>  * Restricted blocksize to 16-bytes
>  * Removed xctr.h and u32_to_le_block
>  * Use single crypto_template instead of array
> ---

Changelog text conventionally goes in the cover letter, not in the individual
patches.  Having the changelog be above the scissors line ("---") is especially
problematic, as it will show up in the git commit message.

> diff --git a/crypto/Kconfig b/crypto/Kconfig
> index fa1741bb568f..8543f34fa200 100644
> --- a/crypto/Kconfig
> +++ b/crypto/Kconfig
> @@ -452,6 +452,15 @@ config CRYPTO_PCBC
>  	  PCBC: Propagating Cipher Block Chaining mode
>  	  This block cipher algorithm is required for RxRPC.
>  
> +config CRYPTO_XCTR
> +	tristate
> +	select CRYPTO_SKCIPHER
> +	select CRYPTO_MANAGER
> +	help
> +	  XCTR: XOR Counter mode. This blockcipher mode is a variant of CTR mode
> +	  using XORs and little-endian addition rather than big-endian arithmetic.
> +	  XCTR mode is used to implement HCTR2.

Now that this option isn't user-selectable, no one will see this help text.
I think it would be best to remove it, and make sure that the comment in
crypto/xctr.c fully explains what XCTR is (currently it's a bit inadequate).

> +/*
> + * Test vectors generated using https://github.com/google/hctr2
> + */
> +static const struct cipher_testvec aes_xctr_tv_template[] = {
[...]
> +		.klen	= 16,
> +		.len	= 255,
[...]
> +		.klen	= 16,
> +		.len	= 255,

I commented on the test vectors before in the context of the HCTR2 ones, but the
same comments apply here: the actual test coverage for the number of test
vectors included is not great, due to lengths being repeated.  It would be
better to vary the lengths a bit more, especially the message lengths.  What you
have here isn't bad, but I think there's some room for improvement.

> +/*
> + * XCTR mode is a blockcipher mode of operation used to implement HCTR2. XCTR is
> + * closely related to the CTR mode of operation; the main difference is that CTR
> + * generates the keystream using E(CTR + IV) whereas XCTR generates the
> + * keystream using E(CTR ^ IV).
> + *
> + * See the HCTR2 paper for more details:
> + *	Length-preserving encryption with HCTR2
> + *      (https://eprint.iacr.org/2021/1441.pdf)
> + */

The above comment could use a bit more detail, e.g. mentioning endianness as
well as the fact that XCTR avoids having to deal with multi-limb integers.

> +static void crypto_xctr_crypt_final(struct skcipher_walk *walk,
> +				   struct crypto_cipher *tfm, u32 byte_ctr)
> +{
> +	unsigned long alignmask = crypto_cipher_alignmask(tfm);
> +	u8 tmp[XCTR_BLOCKSIZE + MAX_CIPHER_ALIGNMASK];
> +	u8 *keystream = PTR_ALIGN(tmp + 0, alignmask + 1);
> +	u8 *src = walk->src.virt.addr;
> +	u8 *dst = walk->dst.virt.addr;
> +	unsigned int nbytes = walk->nbytes;
> +	__le32 ctr32 = cpu_to_le32(byte_ctr / XCTR_BLOCKSIZE + 1);
> +
> +	crypto_xor(walk->iv, (u8 *)&ctr32, sizeof(ctr32));
> +	crypto_cipher_encrypt_one(tfm, keystream, walk->iv);
> +	crypto_xor_cpy(dst, keystream, src, nbytes);
> +	crypto_xor(walk->iv, (u8 *)&ctr32, sizeof(ctr32));
> +}

When crypto_cipher_encrypt_one() is used instead of ->cia_encrypt directly, the
caller doesn't need to align the buffers, sincec crypto_cipher_encrypt_one()
handles that.

> +static int crypto_xctr_crypt_segment(struct skcipher_walk *walk,
> +				    struct crypto_cipher *tfm, u32 byte_ctr)
> +{
> +	void (*fn)(struct crypto_tfm *, u8 *, const u8 *) =
> +		   crypto_cipher_alg(tfm)->cia_encrypt;
> +	u8 *src = walk->src.virt.addr;
> +	u8 *dst = walk->dst.virt.addr;
> +	unsigned int nbytes = walk->nbytes;
> +	__le32 ctr32 = cpu_to_le32(byte_ctr / XCTR_BLOCKSIZE + 1);
> +
> +	do {
> +		/* create keystream */
> +		crypto_xor(walk->iv, (u8 *)&ctr32, sizeof(ctr32));
> +		fn(crypto_cipher_tfm(tfm), dst, walk->iv);
> +		crypto_xor(dst, src, XCTR_BLOCKSIZE);
> +		crypto_xor(walk->iv, (u8 *)&ctr32, sizeof(ctr32));
> +
> +		ctr32++;
> +
> +		src += XCTR_BLOCKSIZE;
> +		dst += XCTR_BLOCKSIZE;
> +	} while ((nbytes -= XCTR_BLOCKSIZE) >= XCTR_BLOCKSIZE);
> +
> +	return nbytes;
> +}

This won't work on big endian systems due to the 'ctr32++' on a __le32 variable.
I recommend installing 'sparse' and passing C=2 to make, as it will warn about
endianness bugs like this.

Either endianness needs to be converted for the increment, or it needs to be
converted when doing the XOR.  (The latter would work if the XOR is done
manually instead of with crypto_xor, which could be a nice optimization
separately from fixing the endianness bug.)

> +	/* Block size must be >= 4 bytes. */
> +	err = -EINVAL;
> +	if (alg->cra_blocksize != XCTR_BLOCKSIZE)
> +		goto out_free_inst;

The comment above is outdated.

- Eric

Eric Biggers Feb. 17, 2022, 1:07 a.m. UTC | #2

On Thu, Feb 10, 2022 at 11:28:08PM +0000, Nathan Huckleberry wrote:
> Changes since v1:
>  * Rename streamcipher -> xctr
>  * Rename hash -> polyval
>  * Use __le64 instead of u64 for little-endian length
>  * memzero_explicit in set_key
>  * Use crypto request length instead of scatterlist length for polyval
>  * Add comments referencing the paper's pseudocode
>  * Derive blockcipher name from xctr name
>  * Pass IV through request context
>  * Use .generic_driver
>  * Make tests more comprehensive

This is looking much better than v1.  A few comments below.

> +static int hctr2_hash_tweak(struct skcipher_request *req)
> +{
> +	__le64 tweak_length_block[2];
> +	struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +	const struct hctr2_tfm_ctx *tctx = crypto_skcipher_ctx(tfm);
> +	struct hctr2_request_ctx *rctx = skcipher_request_ctx(req);
> +	struct shash_desc *hash_desc = &rctx->u.hash_desc;
> +	int err;
> +
> +	memset(tweak_length_block, 0, sizeof(tweak_length_block));
> +	if (req->cryptlen % POLYVAL_BLOCK_SIZE == 0)
> +		tweak_length_block[0] = cpu_to_le64(TWEAK_SIZE * 8 * 2 + 2);
> +	else
> +		tweak_length_block[0] = cpu_to_le64(TWEAK_SIZE * 8 * 2 + 3);
> +
> +	hash_desc->tfm = tctx->polyval;
> +	err = crypto_shash_init(hash_desc);
> +	if (err)
> +		return err;
> +
> +	err = crypto_shash_update(hash_desc, (u8 *)tweak_length_block,
> +				  sizeof(tweak_length_block));
> +	if (err)
> +		return err;
> +	return crypto_shash_update(hash_desc, req->iv, TWEAK_SIZE);
> +}

Have you considered taking advantage of the hash function precomputation that is
possible?  That should improve the performance a bit, especially on short
inputs.  Per-key, a hash state could be precomputed containing the
tweak_length_block, since it will always have the same contents, due to this
implementation only supporting a single tweak length.  hctr2_setkey() could
compute that and export it using crypto_shash_export() into the hctr2_tfm_ctx.
hctr2_crypt() could import it using crypto_shash_import().

Similarly, the tweak only needs to be hashed once per request, as the state
could be exported after hashing it the first time, then imported for the second
hash step.  The saved state would need to be part of the hctr2_request_ctx.

> +static int hctr2_finish(struct skcipher_request *req)
> +{
> +	struct hctr2_request_ctx *rctx = skcipher_request_ctx(req);
> +	u8 digest[POLYVAL_DIGEST_SIZE];
> +	int err;
> +
> +	// U = UU ^ H(T || V)
> +	err = hctr2_hash_tweak(req);
> +	if (err)
> +		return err;
> +	err = hctr2_hash_message(req, rctx->bulk_part_dst, digest);
> +	if (err)
> +		return err;
> +	crypto_xor(rctx->first_block, digest, BLOCKCIPHER_BLOCK_SIZE);
> +
> +	// Copy U into dst scatterlist
> +	scatterwalk_map_and_copy(rctx->first_block, req->dst,
> +				 0, BLOCKCIPHER_BLOCK_SIZE, 1);
> +	return 0;
> +}

The comments here and in hctr2_crypt() are assuming encryption, but the code
also handles decryption.  It would be helpful to give the pseudocode for both,
e.g.:

	//    U = UU ^ H(T || V)
	// or M = MM ^ H(T || N)

> +static int hctr2_create_base(struct crypto_template *tmpl, struct rtattr **tb)
> +{
> +	const char *xctr_name;
> +	const char *polyval_name;
> +	char blockcipher_name[CRYPTO_MAX_ALG_NAME];
> +	int len;
> +
> +	xctr_name = crypto_attr_alg_name(tb[1]);
> +	if (IS_ERR(xctr_name))
> +		return PTR_ERR(xctr_name);
> +
> +	if (!strncmp(xctr_name, "xctr(", 5)) {
> +		len = strscpy(blockcipher_name, xctr_name + 5,
> +			    sizeof(blockcipher_name));
> +
> +		if (len < 1)
> +			return -EINVAL;
> +
> +		if (blockcipher_name[len - 1] != ')')
> +			return -EINVAL;
> +
> +		blockcipher_name[len - 1] = 0;
> +	} else
> +		return -EINVAL;

I don't think this is exactly what Herbert and I had in mind.  It's close, but
the problem with grabbing the block cipher name from the raw string the user
passes in is that the string could be an implementation name like "xctr-aes-ni",
rather than an algorithm name like "xctr(aes)".

The correct way to do this is to wait to determine the block cipher algorithm
until after calling crypto_grab_skcipher().  Then it will be available in the
->cra_name of the skcipher algorithm (that should be xctr).

> diff --git a/crypto/testmgr.h b/crypto/testmgr.h
> index da3736e51982..a16b631730e9 100644
> --- a/crypto/testmgr.h
> +++ b/crypto/testmgr.h
> @@ -33630,4 +33630,674 @@ static const struct hash_testvec polyval_tv_template[] = {
>  	},
>  };
>  
> +/*
> + * Test vectors generated using https://github.com/google/hctr2
> + */
> +static const struct cipher_testvec aes_hctr2_tv_template[] = {
> +		.klen	= 16,
> +		.len	= 16,
> +	},
> +	{
[...]
> +		.klen	= 16,
> +		.len	= 16,
> +	},
> +	{
[...]
> +		.klen	= 16,
> +		.len	= 17,
> +	},
> +	{
[...]
> +		.klen	= 16,
> +		.len	= 17,
> +	},
> +	{
[...]
> +		.klen	= 24,
> +		.len	= 31,
> +	},
> +	{
> +		.klen	= 24,
> +		.len	= 31,
> +	},
[...]
> +	{
> +		.klen	= 24,
> +		.len	= 48,
> +	},
[...]
> +		.klen	= 24,
> +		.len	= 48,
[...]
> +		.klen	= 32,
> +		.len	= 128,
[...]
> +		.klen	= 32,
> +		.len	= 128,
[...]
> +		.klen	= 32,
> +		.len	= 255,
[...]
> +		.klen	= 32,
> +		.len	= 255,
[...]
> +		.klen	= 32,
> +		.len	= 512,
[...]
> +		.klen	= 32,
> +		.len	= 512,

These are better but still could use some improvement.  There are two test
vectors for each message length, but they also have the same key length.  That
provides little additional test coverage over having just half the test vectors.
It would be better to use different key lengths within each pair.  Maybe always
use klen=32 for one test vector, and randomly choose klen=16 or klen=24 for the
other test vector.  The overall point is: we can't test a zillion test vectors,
so we should make sure to get as most test coverage as possible with the smaller
set that will actually be included.

- Eric