Unhandled level 2 translation fault on A72 board.

Message ID	56A7597D.6020609@huawei.com
State	New
Headers	show Delivered-To: patch@linaro.org Received-SPF: pass (google.com: domain of linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org designates 2001:1868:205::9 as permitted sender) client-ip=2001:1868:205::9; Subject: Re: Unhandled level 2 translation fault on A72 board. To: Catalin Marinas <catalin.marinas@arm.com> References: <56A72246.4050105@huawei.com> <20160126110358.GA23579@localhost.localdomain> From: Ding Tianhong <dingtianhong@huawei.com> Message-ID: <56A7597D.6020609@huawei.com> Date: Tue, 26 Jan 2016 19:33:17 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <20160126110358.GA23579@localhost.localdomain> summary: Content analysis details: (-4.2 points) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at http://www.dnswl.org/, medium trust [58.251.152.64 listed in list.dnswl.org] -0.0 RCVD_IN_MSPIKE_H3 RBL: Good reputation (+3) [58.251.152.64 listed in wl.mailspike.net] -0.0 SPF_PASS SPF: sender matches SPF record -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] -0.0 RCVD_IN_MSPIKE_WL Mailspike good senders Precedence: list Cc: Arnd Bergmann <arnd@arndb.de>, Will Deacon <Will.Deacon@arm.com>, Linuxarm <linuxarm@huawei.com>, "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>, "Guohanjun \(Hanjun Guo\)" <guohanjun@huawei.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org

Message ID

56A7597D.6020609@huawei.com

State

New

Headers

Received-SPF: pass (google.com: domain of
	linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org
	designates 2001:1868:205::9 as permitted sender)
	client-ip=2001:1868:205::9; 
Subject: Re: Unhandled level 2 translation fault on A72 board.
To: Catalin Marinas <catalin.marinas@arm.com>
References: <56A72246.4050105@huawei.com>
	<20160126110358.GA23579@localhost.localdomain>
From: Ding Tianhong <dingtianhong@huawei.com>
Message-ID: <56A7597D.6020609@huawei.com>
Date: Tue, 26 Jan 2016 19:33:17 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101
	Thunderbird/38.5.1
MIME-Version: 1.0
In-Reply-To: <20160126110358.GA23579@localhost.localdomain>
Precedence: list
Cc: Arnd Bergmann <arnd@arndb.de>, Will Deacon <Will.Deacon@arm.com>,
	Linuxarm <linuxarm@huawei.com>, "linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>, 
	"Guohanjun \(Hanjun Guo\)" <guohanjun@huawei.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+patch=linaro.org@lists.infradead.org

Commit Message

Ding Tianhong Jan. 26, 2016, 11:33 a.m. UTC

On 2016/1/26 19:03, Catalin Marinas wrote:
> On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:

>> I met this problem when running the hackbench test on A72 chip board:

>>

>> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 

>> pgd = ffffffc01a1f0000 

>> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000

>>

>> CPU: 1 PID: 4779 Comm: sh Tainted: G O 4.1.15+ #21 

>> Hardware name: Hisilicon PhosphorHi1382 EVB (DT) 

>> task: ffffffc0163cc500 ti: ffffffc083abc000 task.ti: ffffffc083abc000 

>> PC is at 0x7f96be0c80 

>> LR is at 0x7fb2684eb4 

>> pc : [<0000007f96be0c80>] lr : [<0000007fb2684eb4>] pstate: 60000000 

> 

> So here it's user space trying to execute from 0x7f96be0c80 (instruction

> abort).

> 

>> sh[4963]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000006

>> pgd = ffffffc0180c6000 

>> [00000000] *pgd=0000000015157003, *pud=0000000015157003, *pmd=0000000000000000 

>>

>> CPU: 0 PID: 4963 Comm: sh Tainted: G O 4.1.15+ #21 

>> Hardware name: Hisilicon PhosphorHi1382 EVB (DT) 

>> task: ffffffc0163cb980 ti: ffffffc0840c8000 task.ti: ffffffc0840c8000 

>> PC is at 0x42c0c8 

>> LR is at 0x42c03c 

>> pc : [<000000000042c0c8>] lr : [<000000000042c03c>] pstate: 80000000 

> 

> And here you have a null pointer dereference.

> 

>> if I run the benchmark only on the core which is in the same cluster,

>> it looks fine and no error happened, but if I enable the core which in

>> the different cluster, it will happened.

>>

>> I remember that I met the same problem on the A57 and fix it by enable

>> the [bit6] of the CPUECTLR_EL1 and enable MN, But this time, I enable

>> the same setting and looks no effort, I have no idea about this

>> problem, does A57 and A72 has so big difference on TLB?

> 

> I can't tell for sure it's a TLB issue. The kernel page table dump shows

> *pmd being 0, so the fault is correctly called "level 2 translation

> fault". It also seems that there is no vma at this address, hence the

> kernel reports it as unhandled. It looks like data corruption which

> could be caused by cache or TLB incoherence. Just make sure the

> interconnect linking the two clusters is configured correctly by

> _firmware_ before Linux starts.

> 

Hi Catalin:

Thanks for the apply, I have try to apply this patch to test:

--- arch/arm64/kernel/process.c | 9 +++++++++
1 file changed, 9 insertions(+)
 
hw_breakpoint_thread_switch(next);
contextidr_thread_switch(next);
+tlb_flush_thread(prev);
+
/*
* Complete any pending TLB or cache maintenance on this CPU in case
* the thread migrates to a different CPU.

The hackbench would work fine after this patch, so I guess that the old thread tlb may not be
invalidate as soon as possible, but I don't know why, everything is fine on A57,
Does I miss something?

Ding




_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Comments

Catalin Marinas Jan. 26, 2016, 11:44 a.m. UTC | #1

On Tue, Jan 26, 2016 at 07:33:17PM +0800, Ding Tianhong wrote:
> On 2016/1/26 19:03, Catalin Marinas wrote:

> > On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:

> >> I met this problem when running the hackbench test on A72 chip board:

> >>

> >> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 

> >> pgd = ffffffc01a1f0000 

> >> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000

[...]
> > I can't tell for sure it's a TLB issue. The kernel page table dump shows

> > *pmd being 0, so the fault is correctly called "level 2 translation

> > fault". It also seems that there is no vma at this address, hence the

> > kernel reports it as unhandled. It looks like data corruption which

> > could be caused by cache or TLB incoherence. Just make sure the

> > interconnect linking the two clusters is configured correctly by

> > _firmware_ before Linux starts.

> 

> Thanks for the apply, I have try to apply this patch to test:

> 

> --- arch/arm64/kernel/process.c | 9 +++++++++

> 1 file changed, 9 insertions(+)

>  

> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c

> index 6391485..d7d8439 100644

> --- a/arch/arm64/kernel/process.c

> +++ b/arch/arm64/kernel/process.c

> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)

> : : "r" (tpidr), "r" (tpidrro));

> }

> +static void tlb_flush_thread(struct task_struct *prev)

> +{

> +/* Flush the prev task&apos;s TLB entries */

> +if (prev->mm)

> +flush_tlb_mm(prev->mm);

> +}

> +

> /*

>   * Thread switching.

>   */

> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,

> hw_breakpoint_thread_switch(next);

> contextidr_thread_switch(next);

> +tlb_flush_thread(prev);

> +

> /*

> * Complete any pending TLB or cache maintenance on this CPU in case

> * the thread migrates to a different CPU.

> 

> The hackbench would work fine after this patch, so I guess that the old thread tlb may not be

> invalidate as soon as possible, but I don't know why, everything is fine on A57,

> Does I miss something?


It looks like the TLB invalidation messages may not get across the CCI
between clusters. I don't have the TRMs at hand but make sure all the
relevant bits in the CPUs and CCI are enabled.

BTW, which kernel version are you running? Is the firmware your own or
built around ARM Trusted Firmware?

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Ding Tianhong Jan. 26, 2016, 1:18 p.m. UTC | #2

On 2016/1/26 19:44, Catalin Marinas wrote:
> On Tue, Jan 26, 2016 at 07:33:17PM +0800, Ding Tianhong wrote:

>> On 2016/1/26 19:03, Catalin Marinas wrote:

>>> On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:

>>>> I met this problem when running the hackbench test on A72 chip board:

>>>>

>>>> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 

>>>> pgd = ffffffc01a1f0000 

>>>> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000

> [...]

>>> I can't tell for sure it's a TLB issue. The kernel page table dump shows

>>> *pmd being 0, so the fault is correctly called "level 2 translation

>>> fault". It also seems that there is no vma at this address, hence the

>>> kernel reports it as unhandled. It looks like data corruption which

>>> could be caused by cache or TLB incoherence. Just make sure the

>>> interconnect linking the two clusters is configured correctly by

>>> _firmware_ before Linux starts.

>>

>> Thanks for the apply, I have try to apply this patch to test:

>>

>> --- arch/arm64/kernel/process.c | 9 +++++++++

>> 1 file changed, 9 insertions(+)

>>  

>> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c

>> index 6391485..d7d8439 100644

>> --- a/arch/arm64/kernel/process.c

>> +++ b/arch/arm64/kernel/process.c

>> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)

>> : : "r" (tpidr), "r" (tpidrro));

>> }

>> +static void tlb_flush_thread(struct task_struct *prev)

>> +{

>> +/* Flush the prev task&apos;s TLB entries */

>> +if (prev->mm)

>> +flush_tlb_mm(prev->mm);

>> +}

>> +

>> /*

>>   * Thread switching.

>>   */

>> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,

>> hw_breakpoint_thread_switch(next);

>> contextidr_thread_switch(next);

>> +tlb_flush_thread(prev);

>> +

>> /*

>> * Complete any pending TLB or cache maintenance on this CPU in case

>> * the thread migrates to a different CPU.

>>

>> The hackbench would work fine after this patch, so I guess that the old thread tlb may not be

>> invalidate as soon as possible, but I don't know why, everything is fine on A57,

>> Does I miss something?

> 

> It looks like the TLB invalidation messages may not get across the CCI

> between clusters. I don't have the TRMs at hand but make sure all the

> relevant bits in the CPUs and CCI are enabled.

> 

Indeed check them several times, and need more information, check it again.


> BTW, which kernel version are you running? Is the firmware your own or

> built around ARM Trusted Firmware?

I use 4.1 kernel version, and the firmware is our own.

Ding
 



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 6391485..d7d8439 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -283,6 +283,13 @@  static void tls_thread_switch(struct task_struct *next)
: : "r" (tpidr), "r" (tpidrro));
}
+static void tlb_flush_thread(struct task_struct *prev)
+{
+/* Flush the prev task&apos;s TLB entries */
+if (prev->mm)
+flush_tlb_mm(prev->mm);
+}
+
/*
  * Thread switching.
  */
@@ -296,6 +303,8 @@  struct task_struct *__switch_to(struct task_struct *prev,

Unhandled level 2 translation fault on A72 board.

Commit Message

Comments

Patch