Message ID | 5242A79D.1030709@linaro.org |
---|---|
State | Superseded |
Headers | show |
On 09/25/2013 05:06 AM, Will Newton wrote: > > Add support for pointer mangling in glibc internal structures in C > and assembler code. > > Tested on armv7 with hard and soft thread pointers. Have you measured the performance versus using the existing global variable? TLS access on ARM is quite slow and it looks to me like it may be faster to use the global variable. Keep in mind that the pointer guard and stack guard do not vary by thread. 32-bit ARM is currently using a global variable e.g. __pointer_chk_guard, all you need to do to make it work is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE to reference the global symbol. This is the second proposal for ARM (first was [1] for AArch64) to support storing the a guard in the TCB, but nobody has responded yet to my question about performance. Cheers, Carlos. [1] https://sourceware.org/ml/libc-ports/2013-08/msg00052.html
> On Sep 25, 2013, at 9:09 AM, "Carlos O'Donell" <carlos@redhat.com> wrote: > >> On 09/25/2013 05:06 AM, Will Newton wrote: >> >> Add support for pointer mangling in glibc internal structures in C >> and assembler code. >> >> Tested on armv7 with hard and soft thread pointers. > > Have you measured the performance versus using the existing > global variable? > > TLS access on ARM is quite slow and it looks to me like it > may be faster to use the global variable. Keep in mind that > the pointer guard and stack guard do not vary by thread. > > 32-bit ARM is currently using a global variable e.g. > __pointer_chk_guard, all you need to do to make it work > is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE > to reference the global symbol. > > This is the second proposal for ARM (first was [1] for > AArch64) to support storing the a guard in the TCB, but > nobody has responded yet to my question about performance. T I wonder the same question. Why move to using to tcb in the first place. The only answer I can figure out is that is what x86 does but that is not always the correct answer. > > Cheers, > Carlos. > > [1] https://sourceware.org/ml/libc-ports/2013-08/msg00052.html >
On 25 September 2013 17:09, Carlos O'Donell <carlos@redhat.com> wrote: > On 09/25/2013 05:06 AM, Will Newton wrote: >> >> Add support for pointer mangling in glibc internal structures in C >> and assembler code. >> >> Tested on armv7 with hard and soft thread pointers. > > Have you measured the performance versus using the existing > global variable? No, but I'll put together a patch for that approach and see how it looks. > TLS access on ARM is quite slow and it looks to me like it > may be faster to use the global variable. Keep in mind that > the pointer guard and stack guard do not vary by thread. From a back of the envelope calculation the cost of accessing TLS is one cycle faster than accessing a global in best case (e.g. Cortex-A15), considerably slower in common case (50-60 cycles, Cortex-A9) and slower still in worst case (function call to libgcc and the kernel, ARMv4/ARMv5). Pointer guard looks to be on slower code paths anyway as compared to stack guard, but you may be right that the global variable solution is the best way to go. > 32-bit ARM is currently using a global variable e.g. > __pointer_chk_guard, all you need to do to make it work > is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE > to reference the global symbol. > > This is the second proposal for ARM (first was [1] for > AArch64) to support storing the a guard in the TCB, but > nobody has responded yet to my question about performance. AArch64 the equation is different - all AArch64 cores have a TLS register, and while it is not general purpose I suspect accessing it will be much faster than on the worst performing 32bit cores. I don't have any numbers though.
On 09/25/2013 12:23 PM, Will Newton wrote: > On 25 September 2013 17:09, Carlos O'Donell <carlos@redhat.com> wrote: >> On 09/25/2013 05:06 AM, Will Newton wrote: >>> >>> Add support for pointer mangling in glibc internal structures in C >>> and assembler code. >>> >>> Tested on armv7 with hard and soft thread pointers. >> >> Have you measured the performance versus using the existing >> global variable? > > No, but I'll put together a patch for that approach and see how it looks. > >> TLS access on ARM is quite slow and it looks to me like it >> may be faster to use the global variable. Keep in mind that >> the pointer guard and stack guard do not vary by thread. > > From a back of the envelope calculation the cost of accessing TLS is > one cycle faster than accessing a global in best case (e.g. > Cortex-A15), considerably slower in common case (50-60 cycles, > Cortex-A9) and slower still in worst case (function call to libgcc and > the kernel, ARMv4/ARMv5). > > Pointer guard looks to be on slower code paths anyway as compared to > stack guard, but you may be right that the global variable solution is > the best way to go. Thanks for exploring this solution. >> 32-bit ARM is currently using a global variable e.g. >> __pointer_chk_guard, all you need to do to make it work >> is adjust the definitions of PTR_MANGLE and PTR_DEMANGLE >> to reference the global symbol. >> >> This is the second proposal for ARM (first was [1] for >> AArch64) to support storing the a guard in the TCB, but >> nobody has responded yet to my question about performance. > > AArch64 the equation is different - all AArch64 cores have a TLS > register, and while it is not general purpose I suspect accessing it > will be much faster than on the worst performing 32bit cores. I don't > have any numbers though. I don't disagree with you, but I'd like to see some due-diligence in testing out the two alternatives and reporting back the performance numbers. You need not implement both, just test two access methods using a small test program and report the data. Cheers, Carlos.
On 09/25/2013 05:23 PM, Will Newton wrote: > From a back of the envelope calculation the cost of accessing TLS is > one cycle faster than accessing a global in best case (e.g. > Cortex-A15) Why is TLS so fast on Cortex-A15? Andrew.
On 25 September 2013 17:59, Andrew Haley <aph@redhat.com> wrote: > On 09/25/2013 05:23 PM, Will Newton wrote: >> From a back of the envelope calculation the cost of accessing TLS is >> one cycle faster than accessing a global in best case (e.g. >> Cortex-A15) > > Why is TLS so fast on Cortex-A15? I don't have exact numbers but the mrc instruction used to load the thread pointer into a general purpose register has very variable timings depending on which core you have. A15 is the fastest of the Cortex-A cores in this regard.
On 09/25/2013 05:06 AM, Will Newton wrote: > > Add support for pointer mangling in glibc internal structures in C > and assembler code. > > Tested on armv7 with hard and soft thread pointers. > > ports/ChangeLog.arm: > > 2013-09-24 Will Newton <will.newton@linaro.org> > > * sysdeps/arm/__longjmp.S (__longjmp): Demangle fp, sp > and lr when restoring register values. > * sysdeps/arm/include/bits/setjmp.h (JMP_BUF_REGLIST): Remove > sp and lr from list and replace fp with a4. > * sysdeps/arm/jmpbuf-unwind.h (_jmpbuf_sp): New function. > (_JMPBUF_UNWINDS_ADJ): Call _jmpbuf_sp. > * sysdeps/arm/nptl/tcb-offsets.sym: Add POINTER_GUARD. > * sysdeps/arm/nptl/tls.h (tcbhead_t): Remove private and add > pointer_guard. (THREAD_GET_POINTER_GUARD): New macro. > (THREAD_SET_POINTER_GUARD): New macro. > (THREAD_COPY_POINTER_GUARD): New macro. > * sysdeps/arm/setjmp.S (__sigsetjmp): Mangle fp, sp and lr > before storing register values. > * sysdeps/unix/sysv/linux/arm/sysdep.h (PTR_MANGLE): New macro. > (PTR_DEMANGLE): Likewise. (PTR_MANGLE2): Likewise. > (PTR_DEMANGLE2): Likewise. As of the fix for CVE-2013-4788 (bug 15754) there is now a regression test that ensures the pointer guard varies with each process and is indeed somewhat variable. You will need to provide your own stackguard-macros.h file with the appropriate macros including POINTER_CHK_GUARD to allow tst-ptrguard1 and tst-ptrguard1-static to pass. If these tests don't pass then you've got something wrong. Cheers, Carlos.
diff --git a/ports/sysdeps/arm/__longjmp.S b/ports/sysdeps/arm/__longjmp.S index a5edede..2b1f7f4 100644 --- a/ports/sysdeps/arm/__longjmp.S +++ b/ports/sysdeps/arm/__longjmp.S @@ -34,10 +34,24 @@ ENTRY (__longjmp) sfi_breg ip, \ ldr r4, [\B, #32] /* jmpbuf's sp */ cfi_undefined (r4) +#ifdef PTR_DEMANGLE + PTR_DEMANGLE (r4, r4, a3, a4) +#endif CHECK_SP (r4) #endif sfi_sp sfi_breg ip, \ ldmia \B!, JMP_BUF_REGLIST +#ifdef PTR_DEMANGLE + PTR_DEMANGLE (fp, a4, a3, a2) + ldr a4, [ip], #4 + PTR_DEMANGLE2 (sp, a4, a3) + ldr a4, [ip], #4 + PTR_DEMANGLE2 (lr, a4, a3) +#else + mov fp, a4 + ldr sp, [ip], #4 + ldr lr, [ip], #4 +#endif cfi_restore (v1) cfi_restore (v2) cfi_restore (v3) diff --git a/ports/sysdeps/arm/include/bits/setjmp.h b/ports/sysdeps/arm/include/bits/setjmp.h index 1559d7b..64505dc 100644 --- a/ports/sysdeps/arm/include/bits/setjmp.h +++ b/ports/sysdeps/arm/include/bits/setjmp.h @@ -26,8 +26,9 @@ #ifndef _ISOMAC /* Register list for a ldm/stm instruction to load/store - the general registers from a __jmp_buf. */ -# define JMP_BUF_REGLIST {v1-v6, sl, fp, sp, lr} + the general registers from a __jmp_buf. The a4 register + contains fp at this point. */ +# define JMP_BUF_REGLIST {a4, v1-v6, sl} /* Index of __jmp_buf where the sp register resides. */ # define __JMP_BUF_SP 8 diff --git a/ports/sysdeps/arm/jmpbuf-unwind.h b/ports/sysdeps/arm/jmpbuf-unwind.h index 0863540..1b0d020 100644 --- a/ports/sysdeps/arm/jmpbuf-unwind.h +++ b/ports/sysdeps/arm/jmpbuf-unwind.h @@ -17,6 +17,7 @@ #include <setjmp.h> #include <stdint.h> +#include <sysdep.h> #include <unwind.h> /* Test if longjmp to JMPBUF would unwind the frame @@ -27,8 +28,18 @@ #define _JMPBUF_CFA_UNWINDS_ADJ(_jmpbuf, _context, _adj) \ _JMPBUF_UNWINDS_ADJ (_jmpbuf, (void *) _Unwind_GetCFA (_context), _adj) +static inline uintptr_t __attribute__ ((unused)) +_jmpbuf_sp (__jmp_buf regs) +{ + uintptr_t sp = regs[__JMP_BUF_SP]; +#ifdef PTR_DEMANGLE + PTR_DEMANGLE (sp); +#endif + return sp; +} + #define _JMPBUF_UNWINDS_ADJ(_jmpbuf, _address, _adj) \ - ((uintptr_t) (_address) - (_adj) < (uintptr_t) (_jmpbuf)[__JMP_BUF_SP] - (_adj)) + ((uintptr_t) (_address) - (_adj) < _jmpbuf_sp (_jmpbuf) - (_adj)) /* We use the normal longjmp for unwinding. */ #define __libc_unwind_longjmp(buf, val) __libc_longjmp (buf, val) diff --git a/ports/sysdeps/arm/nptl/tcb-offsets.sym b/ports/sysdeps/arm/nptl/tcb-offsets.sym index 92cc441..06d792f 100644 --- a/ports/sysdeps/arm/nptl/tcb-offsets.sym +++ b/ports/sysdeps/arm/nptl/tcb-offsets.sym @@ -9,3 +9,5 @@ MULTIPLE_THREADS_OFFSET thread_offsetof (header.multiple_threads) PID_OFFSET thread_offsetof (pid) TID_OFFSET thread_offsetof (tid) + +POINTER_GUARD offsetof (tcbhead_t, pointer_guard) diff --git a/ports/sysdeps/arm/nptl/tls.h b/ports/sysdeps/arm/nptl/tls.h index da15027..e855507 100644 --- a/ports/sysdeps/arm/nptl/tls.h +++ b/ports/sysdeps/arm/nptl/tls.h @@ -56,7 +56,7 @@ typedef union dtv typedef struct { dtv_t *dtv; - void *private; + uintptr_t pointer_guard; } tcbhead_t; /* This is the size of the initial TCB. */ @@ -119,6 +119,14 @@ typedef struct #define THREAD_SETMEM_NC(descr, member, idx, value) \ descr->member[idx] = (value) +/* Get/set the pointer guard field in TCB head. */ +#define THREAD_GET_POINTER_GUARD() \ + (((tcbhead_t *) __builtin_thread_pointer ())->pointer_guard) +#define THREAD_SET_POINTER_GUARD(value) \ + (((tcbhead_t *) __builtin_thread_pointer ())->pointer_guard = (value)) +# define THREAD_COPY_POINTER_GUARD(descr) \ + (((tcbhead_t *) (descr + 1))->pointer_guard = THREAD_GET_POINTER_GUARD ()) + /* Get and set the global scope generation counter in struct pthread. */ #define THREAD_GSCOPE_FLAG_UNUSED 0 #define THREAD_GSCOPE_FLAG_USED 1 diff --git a/ports/sysdeps/arm/setjmp.S b/ports/sysdeps/arm/setjmp.S index a6c161d..b38b919 100644 --- a/ports/sysdeps/arm/setjmp.S +++ b/ports/sysdeps/arm/setjmp.S @@ -24,11 +24,25 @@ #include <arm-features.h> ENTRY (__sigsetjmp) +#ifdef PTR_MANGLE + PTR_MANGLE (a4, fp, a3, ip) +#else + mov a4, fp +#endif mov ip, r0 /* Save registers */ sfi_breg ip, \ stmia \B!, JMP_BUF_REGLIST +#ifdef PTR_MANGLE + PTR_MANGLE2 (a4, sp, a3) + str a4, [ip], #4 + PTR_MANGLE2 (a4, lr, a3) + str a4, [ip], #4 +#else + str sp, [ip], #4 + str lr, [ip], #4 +#endif #if !defined ARM_ASSUME_NO_IWMMXT || defined __SOFTFP__ # define NEED_HWCAP 1 diff --git a/ports/sysdeps/unix/sysv/linux/arm/sysdep.h b/ports/sysdeps/unix/sysv/linux/arm/sysdep.h index b195d8e..5991641 100644 --- a/ports/sysdeps/unix/sysv/linux/arm/sysdep.h +++ b/ports/sysdeps/unix/sysv/linux/arm/sysdep.h @@ -435,8 +435,24 @@ __local_syscall_error: \ #endif /* __ASSEMBLER__ */ -/* Pointer mangling is not yet supported for ARM. */ -#define PTR_MANGLE(var) (void) (var) -#define PTR_DEMANGLE(var) (void) (var) +/* Pointer mangling support. */ +#if defined NOT_IN_libc && defined IS_IN_rtld +/* We cannot use the thread descriptor because in ld.so we use setjmp + earlier than the descriptor is initialized. */ +#else +# ifdef __ASSEMBLER__ +# define PTR_MANGLE(dst, src, guard, tmp) \ + mov tmp, r0; GET_TLS (guard); ldr guard, [r0, POINTER_GUARD]; \ + eor dst, src, guard; mov r0, tmp +# define PTR_MANGLE2(dst, src, guard) eor dst, src, guard +# define PTR_DEMANGLE(dst, src, guard, tmp) \ + PTR_MANGLE (dst, src, guard, tmp) +# define PTR_DEMANGLE2(dst, src, guard) PTR_MANGLE2 (dst, src, guard) +# else +# define PTR_MANGLE(var) \ + (var) = (__typeof (var)) ((uintptr_t) (var) ^ THREAD_GET_POINTER_GUARD ()) +# define PTR_DEMANGLE(var) PTR_MANGLE (var) +# endif +#endif #endif /* linux/arm/sysdep.h */