From patchwork Thu Oct 15 09:40:27 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
X-Patchwork-Id: 54992
Return-Path: <patchwork-forward+bncBDFONDVM3EGBBI7J7WYAKGQENXXC23A@linaro.org>
X-Original-To: linaro@patches.linaro.org
Delivered-To: linaro@patches.linaro.org
Received: from mail-wi0-f200.google.com (mail-wi0-f200.google.com
 [209.85.212.200])
 by patches.linaro.org (Postfix) with ESMTPS id 6AA4A2301F
 for <linaro@patches.linaro.org>; Thu, 15 Oct 2015 09:40:52 +0000 (UTC)
Received: by wicgb1 with SMTP id gb1sf7174148wic.3
 for <linaro@patches.linaro.org>; Thu, 15 Oct 2015 02:40:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:delivered-to:mailing-list:precedence:list-id
 :list-unsubscribe:list-archive:list-post:list-help:sender
 :delivered-to:message-id:date:from:user-agent:mime-version:to:cc
 :subject:content-type:x-original-sender
 :x-original-authentication-results;
 bh=n3DL7ZLBvX5QsSRMIVwa/zYCkq3+1SpYdkXTD58FmbM=;
 b=S8ANTNeRBAkNPxKTzY1Sydi/Rn9+jhY6OXzh3Ejh1SpL+59sqC++3pXls3mctX/uE6
 mpnDGyPmtBGwJj7t3z5Ax30RuIOQtin1G82t2ToUmRzy9VfeDNWdUeTZcorx2nX0/AzV
 nefIWwgLaqeroMqLmRtqQvPHUumMxpzKNfiuGqYga0hv6xdN9Yp18BFC4mHj5HAx9nhh
 gTRticRSa4E9FZDxGAPx5228dLyJ1CURp9YvFJQWr9wGicgqlZRunphxSYJ305QZ4qDU
 s933NBpkbEAns1mgrbcIVS+x9UhMDxkOFPFbP5dhGyDcL1EgPT/oyGUhZR2fwmgXOvbz
 gufg==
X-Gm-Message-State: ALoCoQkMWUGEfmGZKcD52PzraY/5V4EdlPfylZ6pDwGNk/XRYoISlxp4hpd5tDgAbmEWD3YUZEOW
X-Received: by 10.180.219.66 with SMTP id pm2mr431463wic.1.1444902051685;
 Thu, 15 Oct 2015 02:40:51 -0700 (PDT)
X-BeenThere: patchwork-forward@linaro.org
Received: by 10.25.78.81 with SMTP id c78ls130111lfb.45.gmail; Thu, 15 Oct
 2015 02:40:51 -0700 (PDT)
X-Received: by 10.25.206.199 with SMTP id e190mr2580856lfg.39.1444902051523; 
 Thu, 15 Oct 2015 02:40:51 -0700 (PDT)
Received: from mail-lb0-x22a.google.com (mail-lb0-x22a.google.com.
 [2a00:1450:4010:c04::22a]) by mx.google.com with ESMTPS id
 p12si4470024lfe.49.2015.10.15.02.40.51
 for <patchwork-forward@linaro.org>
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 15 Oct 2015 02:40:51 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 2a00:1450:4010:c04::22a as permitted sender)
 client-ip=2a00:1450:4010:c04::22a; 
Received: by lbwr8 with SMTP id r8so64550843lbw.2
 for <patchwork-forward@linaro.org>;
 Thu, 15 Oct 2015 02:40:51 -0700 (PDT)
X-Received: by 10.112.17.34 with SMTP id l2mr3933363lbd.117.1444902051298;
 Thu, 15 Oct 2015 02:40:51 -0700 (PDT)
X-Forwarded-To: patchwork-forward@linaro.org
X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org
Delivered-To: patch@linaro.org
Received: by 10.112.59.35 with SMTP id w3csp511614lbq;
 Thu, 15 Oct 2015 02:40:50 -0700 (PDT)
X-Received: by 10.107.132.217 with SMTP id o86mr7639021ioi.172.1444902050023; 
 Thu, 15 Oct 2015 02:40:50 -0700 (PDT)
Received: from sourceware.org (server1.sourceware.org. [209.132.180.131])
 by mx.google.com with ESMTPS id
 c3si10857403ioe.17.2015.10.15.02.40.49 for <patch@linaro.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Thu, 15 Oct 2015 02:40:50 -0700 (PDT)
Received-SPF: pass (google.com: domain of
 gcc-patches-return-410235-patch=linaro.org@gcc.gnu.org
 designates 209.132.180.131 as permitted sender)
 client-ip=209.132.180.131; 
Received: (qmail 67169 invoked by alias); 15 Oct 2015 09:40:36 -0000
Mailing-List: list patchwork-forward@linaro.org;
 contact patchwork-forward+owners@linaro.org
Precedence: list
List-Id: <patchwork-forward.linaro.org>
List-Unsubscribe: <mailto:googlegroups-manage+836684582541+unsubscribe@googlegroups.com>, 
 <http://groups.google.com/a/linaro.org/group/patchwork-forward/subscribe>
List-Archive: <http://groups.google.com/a/linaro.org/group/patchwork-forward/>
List-Post: <http://groups.google.com/a/linaro.org/group/patchwork-forward/post>, 
 <mailto:patchwork-forward@linaro.org>
List-Help: <http://support.google.com/a/linaro.org/bin/topic.py?topic=25838>, 
 <mailto:patchwork-forward+help@linaro.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 67158 invoked by uid 89); 15 Oct 2015 09:40:35 -0000
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL, BAYES_00,
 SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO
 eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by
 sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
 Thu, 15 Oct 2015 09:40:33 +0000
Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com
 [217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id
 uk-mta-15-qnSnE7MhTCOg0Jt9FsDwIQ-1; Thu, 15 Oct 2015 10:40:27 +0100
Received: from [10.2.207.50] ([10.1.2.79]) by cam-owa1.Emea.Arm.com with
 Microsoft SMTPSVC(6.0.3790.3959); Thu, 15 Oct 2015 10:40:27 +0100
Message-ID: <561F748B.5090705@arm.com>
Date: Thu, 15 Oct 2015 10:40:27 +0100
From: Kyrill Tkachov <kyrylo.tkachov@arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: GCC Patches <gcc-patches@gcc.gnu.org>
CC: Maxim Kuvyrkov <maxim.kuvyrkov@linaro.org>
Subject: [PATCH][haifa-sched] model load/store multiples properly in
 autoprefetcher scheduling
X-MC-Unique: qnSnE7MhTCOg0Jt9FsDwIQ-1
X-IsSubscribed: yes
X-Original-Sender: kyrylo.tkachov@arm.com
X-Original-Authentication-Results: mx.google.com; spf=pass (google.com:
 domain of
 patch+caf_=patchwork-forward=linaro.org@linaro.org designates
 2a00:1450:4010:c04::22a as permitted sender)
 smtp.mailfrom=patch+caf_=patchwork-forward=linaro.org@linaro.org;
 dkim=pass header.i=@gcc.gnu.org
X-Google-Group-Id: 836684582541

Hi all,

I'd like to turn on the scheduling-for-autoprefetching heuristic from haifa-sched for aarch64.
This will have the effect of sorting sequences of loads and stores in ascending order of offsets from a common base.
However, there is a limitation with the current code that I'd like to remove first.

The code that analyzes the offsets of the loads/stores doesn't try to handle load/store-multiple insns.
These appear rather frequently in memory streaming workloads on aarch64 in the form of load-pair/store-pair instructions
i.e. ldp/stp.  In RTL, they are created by the sched_fusion pass + a subsequent peephole and during sched2 they appear
as PARALLEL rtxes of multiple SETs to/from memory.

This patch teaches autopref_multipass_init to handle these kinds of PARALLels as long as the SETs within them are all loads or all stores
and all use the same base register.  We now record the minimum and maximum offsets and use those when ranking pairs of insns.
The exact ranking logic is described in the new function autopref_rank_data.

For aarch64 this allows us to handle load/store pair instructions, for arm we now take into account the equivalent ldrd/strd instructions
and also other load/store-multiple instructions.

Bootstrapped and tested on arm, aarch64, x86_64.
This code is currently only enabled for arm for some cores, so it doesn't affect other platforms.

What do people think?

2015-10-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * sched-int.h (struct autopref_multipass_data_): Remove offset
     field.  Add min_offset, max_offset, multi_mem_insn_p fields.
     * haifa-sched.c (analyze_set_insn_for_autopref): New function.
     (autopref_multipass_init): Use it.  Handle PARALLEL sets.
     (autopref_rank_data): New function.
     (autopref_rank_for_schedule): Use it.
     (autopref_multipass_dfa_lookahead_guard_1): Likewise.

commit a529ccd6a4e07463a40c7dbda10e3a090b0c06d3
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Tue Sep 29 16:58:05 2015 +0100

    model load/store pairs properly in autoprefetcher scheduling

diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index c35d777..223c72c 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -5533,6 +5533,35 @@ insn_finishes_cycle_p (rtx_insn *insn)
   return false;
 }
 
+/* Helper for autopref_multipass_init.  Given a SET insn in PAT and whether
+   we're expecting a memory WRITE or not, check that the insn is relevant to
+   the autoprefetcher modelling code.  Return true iff that is the case.
+   If it is relevant record the base register of the memory op in BASE and
+   the offset in OFFSET.  */
+
+static bool
+analyze_set_insn_for_autopref (rtx pat, int write, rtx *base, int *offset)
+{
+  if (GET_CODE (pat) != SET)
+    return false;
+
+  rtx mem = write ? SET_DEST (pat) : SET_SRC (pat);
+  if (!MEM_P (mem))
+    return false;
+
+  struct address_info info;
+  decompose_mem_address (&info, mem);
+
+  /* TODO: Currently only (base+const) addressing is supported.  */
+  if (info.base == NULL || !REG_P (*info.base)
+      || (info.disp != NULL && !CONST_INT_P (*info.disp)))
+    return false;
+
+  *base = *info.base;
+  *offset = info.disp ? INTVAL (*info.disp) : 0;
+  return true;
+}
+
 /* Functions to model cache auto-prefetcher.
 
    Some of the CPUs have cache auto-prefetcher, which /seems/ to initiate
@@ -5557,30 +5586,139 @@ autopref_multipass_init (const rtx_insn *insn, int write)
 
   gcc_assert (data->status == AUTOPREF_MULTIPASS_DATA_UNINITIALIZED);
   data->base = NULL_RTX;
-  data->offset = 0;
+  data->min_offset = 0;
+  data->max_offset = 0;
+  data->multi_mem_insn_p = false;
   /* Set insn entry initialized, but not relevant for auto-prefetcher.  */
   data->status = AUTOPREF_MULTIPASS_DATA_IRRELEVANT;
 
+  rtx pat = PATTERN (insn);
+
+  /* We have a multi-set insn like a load-multiple or store-multiple.
+     We care about these as long as all the memory ops inside the PARALLEL
+     have the same base register.  We care about the minimum and maximum
+     offsets from that base but don't check for the order of those offsets
+     within the PARALLEL insn itself.  */
+  if (GET_CODE (pat) == PARALLEL)
+    {
+      int n_elems = XVECLEN (pat, 0);
+
+      int i = 0;
+      rtx prev_base = NULL_RTX;
+      int min_offset;
+      int max_offset;
+
+      for (i = 0; i < n_elems; i++)
+	{
+	  rtx set = XVECEXP (pat, 0, i);
+	  if (GET_CODE (set) != SET)
+	    return;
+
+	  rtx base = NULL_RTX;
+	  int offset = 0;
+	  if (!analyze_set_insn_for_autopref (set, write, &base, &offset))
+	    return;
+
+	  if (i == 0)
+	    {
+	      prev_base = base;
+	      min_offset = offset;
+	      max_offset = offset;
+	    }
+	  /* Ensure that all memory operations in the PARALLEL use the same
+	     base register.  */
+	  else if (REGNO (base) != REGNO (prev_base))
+	    return;
+	  else
+	    {
+	      min_offset = MIN (min_offset, offset);
+	      max_offset = MAX (max_offset, offset);
+	    }
+	}
+
+      /* If we reached here then we have a valid PARALLEL of multiple memory
+	 ops with prev_base as the base and min_offset and max_offset
+	 containing the offsets range.  */
+      gcc_assert (prev_base);
+      data->base = prev_base;
+      data->min_offset = min_offset;
+      data->max_offset = max_offset;
+      data->multi_mem_insn_p = true;
+      data->status = AUTOPREF_MULTIPASS_DATA_NORMAL;
+
+      return;
+    }
+
+  /* Otherwise this is a single set memory operation.  */
   rtx set = single_set (insn);
   if (set == NULL_RTX)
     return;
 
-  rtx mem = write ? SET_DEST (set) : SET_SRC (set);
-  if (!MEM_P (mem))
+  if (!analyze_set_insn_for_autopref (set, write, &data->base,
+				       &data->min_offset))
     return;
 
-  struct address_info info;
-  decompose_mem_address (&info, mem);
+  /* This insn is relevant for auto-prefetcher.
+     The base and offset fields will have been filled in the
+     analyze_set_insn_for_autopref call above.  */
+  data->status = AUTOPREF_MULTIPASS_DATA_NORMAL;
+}
 
-  /* TODO: Currently only (base+const) addressing is supported.  */
-  if (info.base == NULL || !REG_P (*info.base)
-      || (info.disp != NULL && !CONST_INT_P (*info.disp)))
-    return;
 
-  /* This insn is relevant for auto-prefetcher.  */
-  data->base = *info.base;
-  data->offset = info.disp ? INTVAL (*info.disp) : 0;
-  data->status = AUTOPREF_MULTIPASS_DATA_NORMAL;
+/* Helper for autopref_rank_for_schedule.  Given the data of two
+   insns relevant to the auto-prefetcher modelling code DATA1 and DATA2
+   return their comparison result.  Return 0 if there is no sensible
+   ranking order for the two insns.  */
+
+static int
+autopref_rank_data (autopref_multipass_data_t data1,
+		     autopref_multipass_data_t data2)
+{
+  /* Simple case when both insns are simple single memory ops.  */
+  if (!data1->multi_mem_insn_p && !data2->multi_mem_insn_p)
+    return data1->min_offset - data2->min_offset;
+
+  /* Two load/store multiple insns.  Return 0 if the offset ranges
+     overlap and the difference between the minimum offsets otherwise.  */
+  else if (data1->multi_mem_insn_p && data2->multi_mem_insn_p)
+    {
+      int min1 = data1->min_offset;
+      int max1 = data1->max_offset;
+      int min2 = data2->min_offset;
+      int max2 = data2->max_offset;
+
+      if (max1 < min2 || min1 > max2)
+	return min1 - min2;
+      else
+	return 0;
+    }
+
+  /* The other two cases is a pair of a load/store multiple and
+     a simple memory op.  Return 0 if the single op's offset is within the
+     range of the multi-op insn and the difference between the single offset
+     and the minimum offset of the multi-set insn otherwise.  */
+  else if (data1->multi_mem_insn_p && !data2->multi_mem_insn_p)
+    {
+      int max1 = data1->max_offset;
+      int min1 = data1->min_offset;
+
+      if (data2->min_offset >= min1
+	  && data2->min_offset <= max1)
+	return 0;
+      else
+	return min1 - data2->min_offset;
+    }
+  else
+    {
+      int max2 = data2->max_offset;
+      int min2 = data2->min_offset;
+
+      if (data1->min_offset >= min2
+	  && data1->min_offset <= max2)
+	return 0;
+      else
+	return data1->min_offset - min2;
+    }
 }
 
 /* Helper function for rank_for_schedule sorting.  */
@@ -5607,7 +5745,7 @@ autopref_rank_for_schedule (const rtx_insn *insn1, const rtx_insn *insn2)
       if (!rtx_equal_p (data1->base, data2->base))
 	continue;
 
-      return data1->offset - data2->offset;
+      return autopref_rank_data (data1, data2);
     }
 
   return 0;
@@ -5632,8 +5770,9 @@ autopref_multipass_dfa_lookahead_guard_1 (const rtx_insn *insn1,
   if (data2->status == AUTOPREF_MULTIPASS_DATA_IRRELEVANT)
     return 0;
 
-  if (rtx_equal_p (data1->base, data2->base)
-      && data1->offset > data2->offset)
+  bool delay_p = rtx_equal_p (data1->base, data2->base)
+		  && autopref_rank_data (data1, data2) > 0;
+  if (delay_p)
     {
       if (sched_verbose >= 2)
 	{
diff --git a/gcc/sched-int.h b/gcc/sched-int.h
index 800262c..12fa93c 100644
--- a/gcc/sched-int.h
+++ b/gcc/sched-int.h
@@ -807,8 +807,17 @@ struct autopref_multipass_data_
 {
   /* Base part of memory address.  */
   rtx base;
-  /* Memory offset.  */
-  int offset;
+
+  /* Memory offsets from the base.  For single simple sets
+     only min_offset is valid.  For multi-set insns min_offset
+     and max_offset record the minimum and maximum offsets from the same
+     base among the sets inside the PARALLEL.  */
+  int min_offset;
+  int max_offset;
+
+  /* Is this a load/store-multiple instruction.  */
+  bool multi_mem_insn_p;
+
   /* Entry status.  */
   enum autopref_multipass_data_status status;
 };