From patchwork Tue Apr 30 19:42:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Henderson X-Patchwork-Id: 793368 Delivered-To: patch@linaro.org Received: by 2002:a05:6000:54f:b0:34d:5089:5a9e with SMTP id b15csp438022wrf; Tue, 30 Apr 2024 12:44:14 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCUjCaOzxJm/238KYNLt2xt81jnPh12DxlvY7XqviP0cO+eclbLJOm6cMd5VDxcACbrtFsvcDcHcJPdvbiVG3jzY X-Google-Smtp-Source: AGHT+IEIKviQT0FG9KGTgOZZON6MMcoPEJyEDnnMTWaXwFPWyYWg/QOi4xMPnYg63xC6pnXcw4Jp X-Received: by 2002:a05:6214:2a8f:b0:6a0:7a3f:d290 with SMTP id jr15-20020a0562142a8f00b006a07a3fd290mr309730qvb.51.1714506253822; Tue, 30 Apr 2024 12:44:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1714506253; cv=none; d=google.com; s=arc-20160816; b=Mt3hMBYJewHbbTtCM8tMKyXqnsfI4qkLnjiAZ4Oax/7oINMVZvdIJSrJziRcTaioTu CgCE21mw8LqTaL3deGBx4sDP4Wen64jb7HbBa+gG7ut7iDwhHOB1shbx7wgi7tbBsafH uWeNmHthiQaEj+WjFFFA2wWjRK77Bbb+hKAzkM/S3pnGsWX5jzwdJgxCa5nuSLR3Orw8 1WWDPI+yscS/oUnQcQ1CzOIlviaVeoEZgqAAz2m9ny4ka1HD+ALmb2CBtSo1MjlFyBYF JPW9wEIiH3iO8HoEbOzM98lfzYu4AMapl3oMxgteSx5i8gkcBXhQDqFegUsr9qgVmQlP fO1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=sender:errors-to:list-subscribe:list-help:list-post:list-archive :list-unsubscribe:list-id:precedence:content-transfer-encoding :mime-version:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=LJEBA8HhooZ+SwbOdLUPlZU44jMDHr0d5xq5JBRqrDA=; fh=QtYERfLY+/6lXIVoMxqjAcWASzkMPfeQc3Ur4v8xQcU=; b=boS9fdk04DWcZwHscVaNzRq+V1/hpEZE3G2+XV6horWWFYnUAXmwzD3RKVxJbp84LA t2GdEL1fVvYLe9fAikAIWJi1J/leKrPlLXSwNL4uaaGJbPHHGwjUR7hkGgDda68QQDnh oDSMWcPgMbuUuVkOGjT+HYrN5Vxhy9U11/2rwdCfGRp5sSIqs7zxtZXrKf9ZSVFBbfrn rUGm19XpF0zexKhQG/y3Ca5nxhmNrv5Xctz2gxW1unO1tWBMmNTINoWbSaZ92F/cvI4i 7p11VvlZpUE/oF/DeY4D1fqVrqGVJYugm2+dPRnGNnpqo7HPdCLyH4kvTsdG8SBCL46y qwmQ==; dara=google.com ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=gSMClsFR; spf=pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from lists.gnu.org (lists.gnu.org. [209.51.188.17]) by mx.google.com with ESMTPS id o11-20020a0562140e4b00b006a0bf9c8841si7763971qvc.528.2024.04.30.12.44.13 for (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Tue, 30 Apr 2024 12:44:13 -0700 (PDT) Received-SPF: pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=gSMClsFR; spf=pass (google.com: domain of qemu-devel-bounces+patch=linaro.org@nongnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom="qemu-devel-bounces+patch=linaro.org@nongnu.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1s1tNa-00073q-27; Tue, 30 Apr 2024 15:43:06 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1s1tNX-00072a-Kp for qemu-devel@nongnu.org; Tue, 30 Apr 2024 15:43:03 -0400 Received: from mail-pf1-x42e.google.com ([2607:f8b0:4864:20::42e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1s1tNV-0006kq-7w for qemu-devel@nongnu.org; Tue, 30 Apr 2024 15:43:03 -0400 Received: by mail-pf1-x42e.google.com with SMTP id d2e1a72fcca58-6f4081574d6so2246626b3a.2 for ; Tue, 30 Apr 2024 12:43:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1714506180; x=1715110980; darn=nongnu.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=LJEBA8HhooZ+SwbOdLUPlZU44jMDHr0d5xq5JBRqrDA=; b=gSMClsFRSR5xf9U9wbcf9robQBXxUisFTLnoRUnOOt3MlWjKNzjegYqPjjdmk96HFw iG4etmq9rOIWIG2Sj8y9Mk0C3cAY0Q2zF5boWZ/JuDUYudNqtIud4OAQ7kiSBWNTaFXP Z82aV1hPxVzDweHd0oblsDUPH9wayou7AAERfMQTV6EJsN8QjJzMqd663pn5LCM0XLfH qTOwUEzg8ALaopowl35TLMqruDbL2gDt6R9Z8c/rL3BQUcywIEkAWBQEW4EUoqqI0Oq9 XEITEpNagPfqi9253lmVF+9xl4e0EK7sdQcpGo/pWzHLFExRhCs08TggyPu/XKRunRXD K0qw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714506180; x=1715110980; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LJEBA8HhooZ+SwbOdLUPlZU44jMDHr0d5xq5JBRqrDA=; b=glh65GU01lSSzgHm5aHAON8iw6mEVWmA5cSSAqNDJdVrFQtHHzIdVttwcL/Vhb/YlR kmpDxhVj8IjcaK4q2VORI3DDtxFl8xRWDR7TNIrDmvf2Kydvl/mkjC/pk73g2jxcnqfx Hzp1y423amWbUUx5uKbW/2dcG2GEgAhKbb4uIVgSgwDCGF1anxv4vjMYjMnTHPyton76 ebcdTPaovCpQNeIvFviI/nFi5cufI306aqq9FnAtlTZgXEEP/cVZ9UePYva6k6P46Tvw TCwgBfldGC07grvoqskwaPO0JnWXp6Iy0K9wV1bJOmk8SFIdMA0OIyDqSywxRSTB4Xyv jZSw== X-Gm-Message-State: AOJu0YwHwP0Qj67qFjQXBHP6W+DIus66o8REaO1ECp1nv4IuuJHsFpII VfIre9v+DbiepvgQImHmsYAZmcllBjAIQmN9DHTV60UMd27Ba3i2yXtD+yOeCgHDm+FNHNlqof5 l X-Received: by 2002:a05:6a00:842:b0:6f3:f062:c09b with SMTP id q2-20020a056a00084200b006f3f062c09bmr705486pfk.6.1714506179829; Tue, 30 Apr 2024 12:42:59 -0700 (PDT) Received: from stoup.. (174-21-72-5.tukw.qwest.net. [174.21.72.5]) by smtp.gmail.com with ESMTPSA id d12-20020a63360c000000b005d880b41598sm20861523pga.94.2024.04.30.12.42.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Apr 2024 12:42:59 -0700 (PDT) From: Richard Henderson To: qemu-devel@nongnu.org Cc: philmd@linaro.org, Alexander Monakov , Mikhail Romanov Subject: [PATCH v7 05/10] util/bufferiszero: Optimize SSE2 and AVX2 variants Date: Tue, 30 Apr 2024 12:42:48 -0700 Message-Id: <20240430194253.904768-6-richard.henderson@linaro.org> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240430194253.904768-1-richard.henderson@linaro.org> References: <20240430194253.904768-1-richard.henderson@linaro.org> MIME-Version: 1.0 Received-SPF: pass client-ip=2607:f8b0:4864:20::42e; envelope-from=richard.henderson@linaro.org; helo=mail-pf1-x42e.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+patch=linaro.org@nongnu.org Sender: qemu-devel-bounces+patch=linaro.org@nongnu.org From: Alexander Monakov Increase unroll factor in SIMD loops from 4x to 8x in order to move their bottlenecks from ALU port contention to load issue rate (two loads per cycle on popular x86 implementations). Avoid using out-of-bounds pointers in loop boundary conditions. Follow SSE2 implementation strategy in the AVX2 variant. Avoid use of PTEST, which is not profitable there (like in the removed SSE4 variant). Signed-off-by: Alexander Monakov Signed-off-by: Mikhail Romanov Reviewed-by: Richard Henderson Message-Id: <20240206204809.9859-6-amonakov@ispras.ru> --- util/bufferiszero.c | 111 +++++++++++++++++++++++++++++--------------- 1 file changed, 73 insertions(+), 38 deletions(-) diff --git a/util/bufferiszero.c b/util/bufferiszero.c index 00118d649e..02df82b4ff 100644 --- a/util/bufferiszero.c +++ b/util/bufferiszero.c @@ -67,62 +67,97 @@ static bool buffer_is_zero_integer(const void *buf, size_t len) #if defined(CONFIG_AVX2_OPT) || defined(__SSE2__) #include -/* Note that each of these vectorized functions require len >= 64. */ +/* Helper for preventing the compiler from reassociating + chains of binary vector operations. */ +#define SSE_REASSOC_BARRIER(vec0, vec1) asm("" : "+x"(vec0), "+x"(vec1)) + +/* Note that these vectorized functions may assume len >= 256. */ static bool __attribute__((target("sse2"))) buffer_zero_sse2(const void *buf, size_t len) { - __m128i t = _mm_loadu_si128(buf); - __m128i *p = (__m128i *)(((uintptr_t)buf + 5 * 16) & -16); - __m128i *e = (__m128i *)(((uintptr_t)buf + len) & -16); - __m128i zero = _mm_setzero_si128(); + /* Unaligned loads at head/tail. */ + __m128i v = *(__m128i_u *)(buf); + __m128i w = *(__m128i_u *)(buf + len - 16); + /* Align head/tail to 16-byte boundaries. */ + const __m128i *p = QEMU_ALIGN_PTR_DOWN(buf + 16, 16); + const __m128i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 16); + __m128i zero = { 0 }; - /* Loop over 16-byte aligned blocks of 64. */ - while (likely(p <= e)) { - t = _mm_cmpeq_epi8(t, zero); - if (unlikely(_mm_movemask_epi8(t) != 0xFFFF)) { + /* Collect a partial block at tail end. */ + v |= e[-1]; w |= e[-2]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-3]; w |= e[-4]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-5]; w |= e[-6]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-7]; v |= w; + + /* + * Loop over complete 128-byte blocks. + * With the head and tail removed, e - p >= 14, so the loop + * must iterate at least once. + */ + do { + v = _mm_cmpeq_epi8(v, zero); + if (unlikely(_mm_movemask_epi8(v) != 0xFFFF)) { return false; } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } + v = p[0]; w = p[1]; + SSE_REASSOC_BARRIER(v, w); + v |= p[2]; w |= p[3]; + SSE_REASSOC_BARRIER(v, w); + v |= p[4]; w |= p[5]; + SSE_REASSOC_BARRIER(v, w); + v |= p[6]; w |= p[7]; + SSE_REASSOC_BARRIER(v, w); + v |= w; + p += 8; + } while (p < e - 7); - /* Finish the aligned tail. */ - t |= e[-3]; - t |= e[-2]; - t |= e[-1]; - - /* Finish the unaligned tail. */ - t |= _mm_loadu_si128(buf + len - 16); - - return _mm_movemask_epi8(_mm_cmpeq_epi8(t, zero)) == 0xFFFF; + return _mm_movemask_epi8(_mm_cmpeq_epi8(v, zero)) == 0xFFFF; } #ifdef CONFIG_AVX2_OPT static bool __attribute__((target("avx2"))) buffer_zero_avx2(const void *buf, size_t len) { - /* Begin with an unaligned head of 32 bytes. */ - __m256i t = _mm256_loadu_si256(buf); - __m256i *p = (__m256i *)(((uintptr_t)buf + 5 * 32) & -32); - __m256i *e = (__m256i *)(((uintptr_t)buf + len) & -32); + /* Unaligned loads at head/tail. */ + __m256i v = *(__m256i_u *)(buf); + __m256i w = *(__m256i_u *)(buf + len - 32); + /* Align head/tail to 32-byte boundaries. */ + const __m256i *p = QEMU_ALIGN_PTR_DOWN(buf + 32, 32); + const __m256i *e = QEMU_ALIGN_PTR_DOWN(buf + len - 1, 32); + __m256i zero = { 0 }; - /* Loop over 32-byte aligned blocks of 128. */ - while (p <= e) { - if (unlikely(!_mm256_testz_si256(t, t))) { + /* Collect a partial block at tail end. */ + v |= e[-1]; w |= e[-2]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-3]; w |= e[-4]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-5]; w |= e[-6]; + SSE_REASSOC_BARRIER(v, w); + v |= e[-7]; v |= w; + + /* Loop over complete 256-byte blocks. */ + for (; p < e - 7; p += 8) { + /* PTEST is not profitable here. */ + v = _mm256_cmpeq_epi8(v, zero); + if (unlikely(_mm256_movemask_epi8(v) != 0xFFFFFFFF)) { return false; } - t = p[-4] | p[-3] | p[-2] | p[-1]; - p += 4; - } ; + v = p[0]; w = p[1]; + SSE_REASSOC_BARRIER(v, w); + v |= p[2]; w |= p[3]; + SSE_REASSOC_BARRIER(v, w); + v |= p[4]; w |= p[5]; + SSE_REASSOC_BARRIER(v, w); + v |= p[6]; w |= p[7]; + SSE_REASSOC_BARRIER(v, w); + v |= w; + } - /* Finish the last block of 128 unaligned. */ - t |= _mm256_loadu_si256(buf + len - 4 * 32); - t |= _mm256_loadu_si256(buf + len - 3 * 32); - t |= _mm256_loadu_si256(buf + len - 2 * 32); - t |= _mm256_loadu_si256(buf + len - 1 * 32); - - return _mm256_testz_si256(t, t); + return _mm256_movemask_epi8(_mm256_cmpeq_epi8(v, zero)) == 0xFFFFFFFF; } #endif /* CONFIG_AVX2_OPT */