From patchwork Fri Oct  9 16:16:05 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Christophe Lyon <christophe.lyon@linaro.org>
X-Patchwork-Id: 54712
Return-Path: <patchwork-forward+bncBDNJJ5OM3EFRBY6Q36YAKGQEQPNCXVA@linaro.org>
X-Original-To: linaro@patches.linaro.org
Delivered-To: linaro@patches.linaro.org
Received: from mail-lb0-f200.google.com (mail-lb0-f200.google.com
 [209.85.217.200])
 by patches.linaro.org (Postfix) with ESMTPS id 78DDC22DB6
 for <linaro@patches.linaro.org>; Fri,  9 Oct 2015 16:16:36 +0000 (UTC)
Received: by lbbti1 with SMTP id ti1sf42414342lbb.3
 for <linaro@patches.linaro.org>; Fri, 09 Oct 2015 09:16:35 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:delivered-to:mailing-list:precedence:list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :delivered-to:mime-version:in-reply-to:references:date:message-id
 :subject:from:to:cc:content-type:x-original-sender
 :x-original-authentication-results;
 bh=xW01Jqegvo4DAq1GK6am2dM6ySde9tzGdHcZ7Pg6ke0=;
 b=PyPULEKtrWFUxvteLScLC2Rx12xwPAw/PwJcl/x8Jbq4xEavMrTS6T6CPa48vrLaks
 Wrmg3mGt4bkZjm4OU6Gt/792JAbxhfBEZl6zvDEYN4b8L5k6b5zaTS7XQN/HTcJccKGs
 cYdxyfbpeiJQQeAOkvjIrk1HULvOwvzweCOHT0oXjOLrDm5DSuhbzyt1GqN5ZoW0j/iU
 qcRaSAhQzcJb/S5xJ4u3JS6pcLziT6w6MlK+CNrnYI1Waiq4oVHX3SGXFNVCKjlAwX1w
 nLeZBnDMesAhXatLjgMzIEhTM1IqgOOa17T/LaaOlNc0pXUOEO/1ko4YigovYMhCTsQH
 FIQw==
X-Gm-Message-State: ALoCoQkPo5+Efb+p0dqn5rkIuwaKwnIaUWakXDnGLiFCQRzGJJyQeEbwxrYldaU6yiYecHXozfp1
X-Received: by 10.112.140.202 with SMTP id ri10mr2129123lbb.10.1444407395284; 
 Fri, 09 Oct 2015 09:16:35 -0700 (PDT)
X-BeenThere: patchwork-forward@linaro.org
Received: by 10.25.141.74 with SMTP id p71ls308111lfd.77.gmail; Fri, 09 Oct
 2015 09:16:35 -0700 (PDT)
X-Received: by 10.112.140.197 with SMTP id ri5mr6976629lbb.65.1444407395147; 
 Fri, 09 Oct 2015 09:16:35 -0700 (PDT)
Received: from mail-lb0-x22d.google.com (mail-lb0-x22d.google.com.
 [2a00:1450:4010:c04::22d])
 by mx.google.com with ESMTPS id s5si1729015lbw.89.2015.10.09.09.16.34
 for <patchwork-forward@linaro.org>
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 09 Oct 2015 09:16:34 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 2a00:1450:4010:c04::22d as permitted sender)
 client-ip=2a00:1450:4010:c04::22d; 
Received: by lbos8 with SMTP id s8so85304962lbo.0
 for <patchwork-forward@linaro.org>;
 Fri, 09 Oct 2015 09:16:34 -0700 (PDT)
X-Received: by 10.112.139.201 with SMTP id ra9mr6962409lbb.29.1444407394742; 
 Fri, 09 Oct 2015 09:16:34 -0700 (PDT)
X-Forwarded-To: patchwork-forward@linaro.org
X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org
Delivered-To: patch@linaro.org
Received: by 10.112.59.35 with SMTP id w3csp83309lbq;
 Fri, 9 Oct 2015 09:16:33 -0700 (PDT)
X-Received: by 10.68.57.137 with SMTP id i9mr15967026pbq.101.1444407393291; 
 Fri, 09 Oct 2015 09:16:33 -0700 (PDT)
Received: from sourceware.org (server1.sourceware.org. [209.132.180.131])
 by mx.google.com with ESMTPS id
 ms6si3560562pbb.247.2015.10.09.09.16.32 for <patch@linaro.org>
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 09 Oct 2015 09:16:33 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-return-409779-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 client-ip=209.132.180.131; 
Received: (qmail 1633 invoked by alias); 9 Oct 2015 16:16:13 -0000
Mailing-List: list patchwork-forward@linaro.org;
 contact patchwork-forward+owners@linaro.org
Precedence: list
List-Id: <patchwork-forward.linaro.org>
List-Unsubscribe: <mailto:googlegroups-manage+836684582541+unsubscribe@googlegroups.com>, 
 <http://groups.google.com/a/linaro.org/group/patchwork-forward/subscribe>
List-Archive: <http://groups.google.com/a/linaro.org/group/patchwork-forward/>
List-Post: <http://groups.google.com/a/linaro.org/group/patchwork-forward/post>, 
 <mailto:patchwork-forward@linaro.org>
List-Help: <http://support.google.com/a/linaro.org/bin/topic.py?topic=25838>, 
 <mailto:patchwork-forward+help@linaro.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 1582 invoked by uid 89); 9 Oct 2015 16:16:12 -0000
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.5 required=5.0 tests=AWL, BAYES_00,
 RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=ham version=3.3.2
X-HELO: mail-qg0-f48.google.com
Received: from mail-qg0-f48.google.com (HELO mail-qg0-f48.google.com)
 (209.85.192.48) by sourceware.org
 (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256
 encrypted) ESMTPS; Fri, 09 Oct 2015 16:16:08 +0000
Received: by qgez77 with SMTP id z77so72998879qge.1 for
 <gcc-patches@gcc.gnu.org>; Fri, 09 Oct 2015 09:16:05 -0700 (PDT)
MIME-Version: 1.0
X-Received: by 10.140.43.164 with SMTP id e33mr16195407qga.62.1444407365774;
 Fri, 09 Oct 2015 09:16:05 -0700 (PDT)
Received: by 10.140.44.10 with HTTP; Fri, 9 Oct 2015 09:16:05 -0700 (PDT)
In-Reply-To: <20151008091230.GA13098@arm.com>
References: <CAKdteObt_dP63aqn3eH6mHiK5zXP+Y_rL+DfN55D=WfK_4cVGw@mail.gmail.com>	<20151007150941.GA31205@arm.com>	<CAKdteOYBU7y-z0J5d9ijU+O=DZPkLTPjjiRyhD8ywHoa4K5QPw@mail.gmail.com>	<20151008091230.GA13098@arm.com>
Date: Fri, 9 Oct 2015 18:16:05 +0200
Message-ID: <CAKdteOawhToG=aw7sYYvHva4EiW46a2EDWiA9hW8GtbAqmNRkQ@mail.gmail.com>
Subject: Re: [AArch64_be] Fix vtbl[34] and vtbx4
From: Christophe Lyon <christophe.lyon@linaro.org>
To: James Greenhalgh <james.greenhalgh@arm.com>
Cc: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
X-IsSubscribed: yes
X-Original-Sender: christophe.lyon@linaro.org
X-Original-Authentication-Results: mx.google.com; spf=pass (google.com:
 domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 2a00:1450:4010:c04::22d as permitted sender)
 smtp.mailfrom=patch+caf_=patchwork-forward=linaro.org@linaro.org;
 dkim=pass header.i=@gcc.gnu.org
X-Google-Group-Id: 836684582541

On 8 October 2015 at 11:12, James Greenhalgh <james.greenhalgh@arm.com> wrote:
> On Wed, Oct 07, 2015 at 09:07:30PM +0100, Christophe Lyon wrote:
>> On 7 October 2015 at 17:09, James Greenhalgh <james.greenhalgh@arm.com> wrote:
>> > On Tue, Sep 15, 2015 at 05:25:25PM +0100, Christophe Lyon wrote:
>> >
>> > Why do we want this for vtbx4 rather than putting out a VTBX instruction
>> > directly (as in the inline asm versions you replace)?
>> >
>> I just followed the pattern used for vtbx3.
>>
>> > This sequence does make sense for vtbx3.
>> In fact, I don't see why vtbx3 and vtbx4 should be different?
>
> The difference between TBL and TBX is in their handling of a request to
> select an out-of-range value. For TBL this returns zero, for TBX this
> returns the value which was already in the destination register.
>
> Because the byte-vectors used by the TBX instruction in aarch64 are 128-bit
> (so two of them togather allow selecting elements in the range 0-31), and
> vtbx3 needs to emulate the AArch32 behaviour of picking elements from 3x64-bit
> vectors (allowing elements in the range 0-23), we need to manually check for
> values which would have been out-of-range on AArch32, but are not out
> of range for AArch64 and handle them appropriately. For vtbx4 on the other
> hand, 2x128-bit registers give the range 0..31 and 4x64-bit registers give
> the range 0..31, so we don't need the special masked handling.
>
> You can find the suggested instruction sequences for the Neon intrinsics
> in this document:
>
>   http://infocenter.arm.com/help/topic/com.arm.doc.ihi0073a/IHI0073A_arm_neon_intrinsics_ref.pdf
>

Hi James,

Please find attached an updated version which hopefully addresses your comments.
Tested on aarch64-none-elf and aarch64_be-none-elf using the Foundation Model.

OK?

Christophe.

>> >>  /* vtrn */
>> >>
>> >>  __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
>> >> diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
>> >> index b8a45d1..dfbd9cd 100644
>> >> --- a/gcc/config/aarch64/iterators.md
>> >> +++ b/gcc/config/aarch64/iterators.md
>> >> @@ -100,6 +100,8 @@
>> >>  ;; All modes.
>> >>  (define_mode_iterator VALL [V8QI V16QI V4HI V8HI V2SI V4SI V2DI V2SF V4SF V2DF])
>> >>
>> >> +(define_mode_iterator V8Q [V8QI])
>> >> +
>> >
>> > This can be dropped if you use VAR1 in aarch64-builtins.c.
>> >
>> > Thanks for working on this, with your patch applied, the only
>> > remaining intrinsics I see failing for aarch64_be are:
>> >
>> >   vqtbl2_*8
>> >   vqtbl2q_*8
>> >   vqtbl3_*8
>> >   vqtbl3q_*8
>> >   vqtbl4_*8
>> >   vqtbl4q_*8
>> >
>> >   vqtbx2_*8
>> >   vqtbx2q_*8
>> >   vqtbx3_*8
>> >   vqtbx3q_*8
>> >   vqtbx4_*8
>> >   vqtbx4q_*8
>> >
>> Quite possibly. Which tests are you looking at? Since these are
>> aarch64-specific, they are not part of the
>> tests I added (advsimd-intrinsics). Do you mean
>> gcc.target/aarch64/table-intrinsics.c?
>
> Sorry, yes I should have given a reference. I'm running with a variant of
> a testcase from the LLVM test-suite repository:
>
>   SingleSource/UnitTests/Vector/AArch64/aarch64_neon_intrinsics.c
>
> This has an execute test for most of the intrinsics specified for AArch64.
> It needs some modification to cover the intrinsics we don't implement yet.
>
> Thanks,
> James
>
2015-10-09  Christophe Lyon  <christophe.lyon@linaro.org>

	* config/aarch64/aarch64-simd-builtins.def: Update builtins
	tables: add tbl3 and tbx4.
	* config/aarch64/aarch64-simd.md (aarch64_tbl3v8qi): New.
	(aarch64_tbx4v8qi): New.
	* config/aarch64/arm_neon.h (vtbl3_s8, vtbl3_u8, vtbl3_p8)
	(vtbl4_s8, vtbl4_u8, vtbl4_p8, vtbx4_s8, vtbx4_u8, vtbx4_p8):
	Rewrite using builtin functions.
	* config/aarch64/iterators.md (UNSPEC_TBX): New.

diff --git a/gcc/config/aarch64/aarch64-simd-builtins.def b/gcc/config/aarch64/aarch64-simd-builtins.def
index d0f298a..c16e82c9 100644
--- a/gcc/config/aarch64/aarch64-simd-builtins.def
+++ b/gcc/config/aarch64/aarch64-simd-builtins.def
@@ -405,3 +405,8 @@
   VAR1 (BINOPP, crypto_pmull, 0, di)
   VAR1 (BINOPP, crypto_pmull, 0, v2di)
 
+  /* Implemented by aarch64_tbl3v8qi.  */
+  VAR1 (BINOP, tbl3, 0, v8qi)
+
+  /* Implemented by aarch64_tbx4v8qi.  */
+  VAR1 (TERNOP, tbx4, 0, v8qi)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 9777418..6027582 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4716,6 +4714,27 @@
   [(set_attr "type" "neon_tbl2_q")]
 )
 
+(define_insn "aarch64_tbl3v8qi"
+  [(set (match_operand:V8QI 0 "register_operand" "=w")
+	(unspec:V8QI [(match_operand:OI 1 "register_operand" "w")
+		      (match_operand:V8QI 2 "register_operand" "w")]
+		      UNSPEC_TBL))]
+  "TARGET_SIMD"
+  "tbl\\t%S0.8b, {%S1.16b - %T1.16b}, %S2.8b"
+  [(set_attr "type" "neon_tbl3")]
+)
+
+(define_insn "aarch64_tbx4v8qi"
+  [(set (match_operand:V8QI 0 "register_operand" "=w")
+	(unspec:V8QI [(match_operand:V8QI 1 "register_operand" "0")
+		      (match_operand:OI 2 "register_operand" "w")
+		      (match_operand:V8QI 3 "register_operand" "w")]
+		      UNSPEC_TBX))]
+  "TARGET_SIMD"
+  "tbx\\t%S0.8b, {%S2.16b - %T2.16b}, %S3.8b"
+  [(set_attr "type" "neon_tbl4")]
+)
+
 (define_insn_and_split "aarch64_combinev16qi"
   [(set (match_operand:OI 0 "register_operand" "=w")
 	(unspec:OI [(match_operand:V16QI 1 "register_operand" "w")
diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h
index 6dfebe7..e99819e 100644
--- a/gcc/config/aarch64/arm_neon.h
+++ b/gcc/config/aarch64/arm_neon.h
@@ -10902,13 +10902,14 @@ vtbl3_s8 (int8x8x3_t tab, int8x8_t idx)
 {
   int8x8_t result;
   int8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_s8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_s8 (tab.val[2], vcreate_s8 (__AARCH64_UINT64_C (0x0)));
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "=w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = __builtin_aarch64_tbl3v8qi (__o, idx);
   return result;
 }
 
@@ -10917,13 +10918,14 @@ vtbl3_u8 (uint8x8x3_t tab, uint8x8_t idx)
 {
   uint8x8_t result;
   uint8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_u8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_u8 (tab.val[2], vcreate_u8 (__AARCH64_UINT64_C (0x0)));
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "=w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = (uint8x8_t)__builtin_aarch64_tbl3v8qi (__o, (int8x8_t)idx);
   return result;
 }
 
@@ -10932,13 +10934,14 @@ vtbl3_p8 (poly8x8x3_t tab, uint8x8_t idx)
 {
   poly8x8_t result;
   poly8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_p8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_p8 (tab.val[2], vcreate_p8 (__AARCH64_UINT64_C (0x0)));
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "=w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = (poly8x8_t)__builtin_aarch64_tbl3v8qi (__o, (int8x8_t)idx);
   return result;
 }
 
@@ -10947,13 +10950,14 @@ vtbl4_s8 (int8x8x4_t tab, int8x8_t idx)
 {
   int8x8_t result;
   int8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_s8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_s8 (tab.val[2], tab.val[3]);
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "=w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = __builtin_aarch64_tbl3v8qi (__o, idx);
   return result;
 }
 
@@ -10962,13 +10966,14 @@ vtbl4_u8 (uint8x8x4_t tab, uint8x8_t idx)
 {
   uint8x8_t result;
   uint8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_u8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_u8 (tab.val[2], tab.val[3]);
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "=w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = (uint8x8_t)__builtin_aarch64_tbl3v8qi (__o, (int8x8_t)idx);
   return result;
 }
 
@@ -10977,13 +10982,14 @@ vtbl4_p8 (poly8x8x4_t tab, uint8x8_t idx)
 {
   poly8x8_t result;
   poly8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
   temp.val[0] = vcombine_p8 (tab.val[0], tab.val[1]);
   temp.val[1] = vcombine_p8 (tab.val[2], tab.val[3]);
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbl %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "=w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = (poly8x8_t)__builtin_aarch64_tbl3v8qi (__o, (int8x8_t)idx);
   return result;
 }
 
@@ -11023,51 +11029,6 @@ vtbx2_p8 (poly8x8_t r, poly8x8x2_t tab, uint8x8_t idx)
   return result;
 }
 
-__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
-vtbx4_s8 (int8x8_t r, int8x8x4_t tab, int8x8_t idx)
-{
-  int8x8_t result = r;
-  int8x16x2_t temp;
-  temp.val[0] = vcombine_s8 (tab.val[0], tab.val[1]);
-  temp.val[1] = vcombine_s8 (tab.val[2], tab.val[3]);
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbx %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "+w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
-  return result;
-}
-
-__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
-vtbx4_u8 (uint8x8_t r, uint8x8x4_t tab, uint8x8_t idx)
-{
-  uint8x8_t result = r;
-  uint8x16x2_t temp;
-  temp.val[0] = vcombine_u8 (tab.val[0], tab.val[1]);
-  temp.val[1] = vcombine_u8 (tab.val[2], tab.val[3]);
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbx %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "+w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
-  return result;
-}
-
-__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
-vtbx4_p8 (poly8x8_t r, poly8x8x4_t tab, uint8x8_t idx)
-{
-  poly8x8_t result = r;
-  poly8x16x2_t temp;
-  temp.val[0] = vcombine_p8 (tab.val[0], tab.val[1]);
-  temp.val[1] = vcombine_p8 (tab.val[2], tab.val[3]);
-  __asm__ ("ld1 {v16.16b - v17.16b }, %1\n\t"
-	   "tbx %0.8b, {v16.16b - v17.16b}, %2.8b\n\t"
-           : "+w"(result)
-           : "Q"(temp), "w"(idx)
-           : "v16", "v17", "memory");
-  return result;
-}
-
 /* End of temporary inline asm.  */
 
 /* Start of optimal implementations in approved order.  */
@@ -23221,6 +23182,58 @@ vtbx3_p8 (poly8x8_t __r, poly8x8x3_t __tab, uint8x8_t __idx)
   return vbsl_p8 (__mask, __tbl, __r);
 }
 
+/* vtbx4  */
+
+__extension__ static __inline int8x8_t __attribute__ ((__always_inline__))
+vtbx4_s8 (int8x8_t __r, int8x8x4_t __tab, int8x8_t __idx)
+{
+  int8x8_t result;
+  int8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
+  temp.val[0] = vcombine_s8 (__tab.val[0], __tab.val[1]);
+  temp.val[1] = vcombine_s8 (__tab.val[2], __tab.val[3]);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = __builtin_aarch64_tbx4v8qi (__r, __o, __idx);
+  return result;
+}
+
+__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__))
+vtbx4_u8 (uint8x8_t __r, uint8x8x4_t __tab, uint8x8_t __idx)
+{
+  uint8x8_t result;
+  uint8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
+  temp.val[0] = vcombine_u8 (__tab.val[0], __tab.val[1]);
+  temp.val[1] = vcombine_u8 (__tab.val[2], __tab.val[3]);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = (uint8x8_t)__builtin_aarch64_tbx4v8qi ((int8x8_t)__r, __o,
+						  (int8x8_t)__idx);
+  return result;
+}
+
+__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__))
+vtbx4_p8 (poly8x8_t __r, poly8x8x4_t __tab, uint8x8_t __idx)
+{
+  poly8x8_t result;
+  poly8x16x2_t temp;
+  __builtin_aarch64_simd_oi __o;
+  temp.val[0] = vcombine_p8 (__tab.val[0], __tab.val[1]);
+  temp.val[1] = vcombine_p8 (__tab.val[2], __tab.val[3]);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[0], 0);
+  __o = __builtin_aarch64_set_qregoiv16qi (__o,
+					   (int8x16_t) temp.val[1], 1);
+  result = (poly8x8_t)__builtin_aarch64_tbx4v8qi ((int8x8_t)__r, __o,
+						  (int8x8_t)__idx);
+  return result;
+}
+
 /* vtrn */
 
 __extension__ static __inline float32x2_t __attribute__ ((__always_inline__))
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index b8a45d1..d856117 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -253,6 +253,7 @@
     UNSPEC_USHLL	; Used in aarch64-simd.md.
     UNSPEC_ADDP		; Used in aarch64-simd.md.
     UNSPEC_TBL		; Used in vector permute patterns.
+    UNSPEC_TBX		; Used in vector permute patterns.
     UNSPEC_CONCAT	; Used in vector permute patterns.
     UNSPEC_ZIP1		; Used in vector permute patterns.
     UNSPEC_ZIP2		; Used in vector permute patterns.