Message ID | 80a33cc325055bc9d63e4ef272c5b7f68f8fa812.1406301772.git.ian.campbell@citrix.com |
---|---|
State | New |
Headers | show |
Hi Ian, On 07/25/2014 04:22 PM, Ian Campbell wrote: > The only really interesting changes here are the updates to mem* which update > to actually optimised versions and introduce an optimised memcmp. I didn't read the whole code as I assume it's just a copy with few changes from Linux. Acked-by: Julien Grall <julien.grall@linaro.org> Regards, > bitops: No change to the bits we import. Record new baseline. > > cmpxchg: Import: > 60010e5 arm64: cmpxchg: update macros to prevent warnings > Author: Mark Hambleton <mahamble@broadcom.com> > Signed-off-by: Mark Hambleton <mahamble@broadcom.com> > Signed-off-by: Mark Brown <broonie@linaro.org> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > e1dfda9 arm64: xchg: prevent warning if return value is unused > Author: Will Deacon <will.deacon@arm.com> > Signed-off-by: Will Deacon <will.deacon@arm.com> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > e1dfda9 resolves the warning which previous caused us to skip 60010e508111. > > Since arm32 and arm64 now differ (as do Linux arm and arm64) here the > existing definition in asm/system.h gets moved to asm/arm32/cmpxchg.h. > Previously this was shadowing the arm64 one but they happened to be identical. > > atomics: Import: > 8715466 arch,arm64: Convert smp_mb__*() > Author: Peter Zijlstra <peterz@infradead.org> > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > > This just drops some unused (by us) smp_mb__*_atomic_*. > > spinlocks: No change. Record new baseline. > > mem*: Import: > 808dbac arm64: lib: Implement optimized memcpy routine > Author: zhichang.yuan <zhichang.yuan@linaro.org> > Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org> > Signed-off-by: Deepak Saxena <dsaxena@linaro.org> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > 280adc1 arm64: lib: Implement optimized memmove routine > Author: zhichang.yuan <zhichang.yuan@linaro.org> > Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org> > Signed-off-by: Deepak Saxena <dsaxena@linaro.org> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > b29a51f arm64: lib: Implement optimized memset routine > Author: zhichang.yuan <zhichang.yuan@linaro.org> > Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org> > Signed-off-by: Deepak Saxena <dsaxena@linaro.org> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > d875c9b arm64: lib: Implement optimized memcmp routine > Author: zhichang.yuan <zhichang.yuan@linaro.org> > Signed-off-by: Zhichang Yuan <zhichang.yuan@linaro.org> > Signed-off-by: Deepak Saxena <dsaxena@linaro.org> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> > > These import various routines from Linaro's Cortex Strings library. > > Added assembler.h similar to on arm32 to define the various magic symbols > which these imported routines depend on (e.g. CPU_LE() and CPU_BE()) > > str*: No changes. Record new baseline. > > Correct the paths in the README. > > *_page: No changes. Record new baseline. > > README previous said clear_page was unused while clear page was, which was > backwards. > > Signed-off-by: Ian Campbell <ian.campbell@citrix.com> > --- > xen/arch/arm/README.LinuxPrimitives | 36 +++-- > xen/arch/arm/arm64/lib/Makefile | 2 +- > xen/arch/arm/arm64/lib/assembler.h | 13 ++ > xen/arch/arm/arm64/lib/memchr.S | 1 + > xen/arch/arm/arm64/lib/memcmp.S | 258 +++++++++++++++++++++++++++++++++++ > xen/arch/arm/arm64/lib/memcpy.S | 193 +++++++++++++++++++++++--- > xen/arch/arm/arm64/lib/memmove.S | 191 ++++++++++++++++++++++---- > xen/arch/arm/arm64/lib/memset.S | 208 +++++++++++++++++++++++++--- > xen/include/asm-arm/arm32/cmpxchg.h | 3 + > xen/include/asm-arm/arm64/atomic.h | 5 - > xen/include/asm-arm/arm64/cmpxchg.h | 35 +++-- > xen/include/asm-arm/string.h | 5 + > xen/include/asm-arm/system.h | 3 - > 13 files changed, 844 insertions(+), 109 deletions(-) > create mode 100644 xen/arch/arm/arm64/lib/assembler.h > create mode 100644 xen/arch/arm/arm64/lib/memcmp.S > > diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives > index 6cd03ca..69eeb70 100644 > --- a/xen/arch/arm/README.LinuxPrimitives > +++ b/xen/arch/arm/README.LinuxPrimitives > @@ -6,29 +6,26 @@ were last updated. > arm64: > ===================================================================== > > -bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b) > +bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027) > > linux/arch/arm64/lib/bitops.S xen/arch/arm/arm64/lib/bitops.S > linux/arch/arm64/include/asm/bitops.h xen/include/asm-arm/arm64/bitops.h > > --------------------------------------------------------------------- > > -cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189) > +cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b) > > linux/arch/arm64/include/asm/cmpxchg.h xen/include/asm-arm/arm64/cmpxchg.h > > -Skipped: > - 60010e5 arm64: cmpxchg: update macros to prevent warnings > - > --------------------------------------------------------------------- > > -atomics: last sync @ v3.14-rc7 (last commit: 95c4189) > +atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027) > > linux/arch/arm64/include/asm/atomic.h xen/include/asm-arm/arm64/atomic.h > > --------------------------------------------------------------------- > > -spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189) > +spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9) > > linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h > > @@ -38,30 +35,31 @@ Skipped: > > --------------------------------------------------------------------- > > -mem*: last sync @ v3.14-rc7 (last commit: 4a89922) > +mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240) > > -linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S > -linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S > -linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S > -linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S > +linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S > +linux/arch/arm64/lib/memcmp.S xen/arch/arm/arm64/lib/memcmp.S > +linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S > +linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S > +linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S > > -for i in memchr.S memcpy.S memmove.S memset.S ; do > +for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do > diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i > done > > --------------------------------------------------------------------- > > -str*: last sync @ v3.14-rc7 (last commit: 2b8cac8) > +str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5) > > -linux/arch/arm/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S > -linux/arch/arm/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S > +linux/arch/arm64/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S > +linux/arch/arm64/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S > > --------------------------------------------------------------------- > > -{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13) > +{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387) > > -linux/arch/arm64/lib/clear_page.S unused in Xen > -linux/arch/arm64/lib/copy_page.S xen/arch/arm/arm64/lib/copy_page.S > +linux/arch/arm64/lib/clear_page.S xen/arch/arm/arm64/lib/clear_page.S > +linux/arch/arm64/lib/copy_page.S unused in Xen > > ===================================================================== > arm32 > diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile > index b895afa..2e7fb64 100644 > --- a/xen/arch/arm/arm64/lib/Makefile > +++ b/xen/arch/arm/arm64/lib/Makefile > @@ -1,4 +1,4 @@ > -obj-y += memcpy.o memmove.o memset.o memchr.o > +obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o > obj-y += clear_page.o > obj-y += bitops.o find_next_bit.o > obj-y += strchr.o strrchr.o > diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h > new file mode 100644 > index 0000000..84669d1 > --- /dev/null > +++ b/xen/arch/arm/arm64/lib/assembler.h > @@ -0,0 +1,13 @@ > +#ifndef __ASM_ASSEMBLER_H__ > +#define __ASM_ASSEMBLER_H__ > + > +#ifndef __ASSEMBLY__ > +#error "Only include this from assembly code" > +#endif > + > +/* Only LE support so far */ > +#define CPU_BE(x...) > +#define CPU_LE(x...) x > + > +#endif /* __ASM_ASSEMBLER_H__ */ > + > diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S > index 3cc1b01..b04590c 100644 > --- a/xen/arch/arm/arm64/lib/memchr.S > +++ b/xen/arch/arm/arm64/lib/memchr.S > @@ -18,6 +18,7 @@ > */ > > #include <xen/config.h> > +#include "assembler.h" > > /* > * Find a character in an area of memory. > diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S > new file mode 100644 > index 0000000..9aad925 > --- /dev/null > +++ b/xen/arch/arm/arm64/lib/memcmp.S > @@ -0,0 +1,258 @@ > +/* > + * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see <http://www.gnu.org/licenses/>. > + */ > + > +#include <xen/config.h> > +#include "assembler.h" > + > +/* > +* compare memory areas(when two memory areas' offset are different, > +* alignment handled by the hardware) > +* > +* Parameters: > +* x0 - const memory area 1 pointer > +* x1 - const memory area 2 pointer > +* x2 - the maximal compare byte length > +* Returns: > +* x0 - a compare result, maybe less than, equal to, or greater than ZERO > +*/ > + > +/* Parameters and result. */ > +src1 .req x0 > +src2 .req x1 > +limit .req x2 > +result .req x0 > + > +/* Internal variables. */ > +data1 .req x3 > +data1w .req w3 > +data2 .req x4 > +data2w .req w4 > +has_nul .req x5 > +diff .req x6 > +endloop .req x7 > +tmp1 .req x8 > +tmp2 .req x9 > +tmp3 .req x10 > +pos .req x11 > +limit_wd .req x12 > +mask .req x13 > + > +ENTRY(memcmp) > + cbz limit, .Lret0 > + eor tmp1, src1, src2 > + tst tmp1, #7 > + b.ne .Lmisaligned8 > + ands tmp1, src1, #7 > + b.ne .Lmutual_align > + sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ > + lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */ > + /* > + * The input source addresses are at alignment boundary. > + * Directly compare eight bytes each time. > + */ > +.Lloop_aligned: > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > +.Lstart_realigned: > + subs limit_wd, limit_wd, #1 > + eor diff, data1, data2 /* Non-zero if differences found. */ > + csinv endloop, diff, xzr, cs /* Last Dword or differences. */ > + cbz endloop, .Lloop_aligned > + > + /* Not reached the limit, must have found a diff. */ > + tbz limit_wd, #63, .Lnot_limit > + > + /* Limit % 8 == 0 => the diff is in the last 8 bytes. */ > + ands limit, limit, #7 > + b.eq .Lnot_limit > + /* > + * The remained bytes less than 8. It is needed to extract valid data > + * from last eight bytes of the intended memory range. > + */ > + lsl limit, limit, #3 /* bytes-> bits. */ > + mov mask, #~0 > +CPU_BE( lsr mask, mask, limit ) > +CPU_LE( lsl mask, mask, limit ) > + bic data1, data1, mask > + bic data2, data2, mask > + > + orr diff, diff, mask > + b .Lnot_limit > + > +.Lmutual_align: > + /* > + * Sources are mutually aligned, but are not currently at an > + * alignment boundary. Round down the addresses and then mask off > + * the bytes that precede the start point. > + */ > + bic src1, src1, #7 > + bic src2, src2, #7 > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + /* > + * We can not add limit with alignment offset(tmp1) here. Since the > + * addition probably make the limit overflown. > + */ > + sub limit_wd, limit, #1/*limit != 0, so no underflow.*/ > + and tmp3, limit_wd, #7 > + lsr limit_wd, limit_wd, #3 > + add tmp3, tmp3, tmp1 > + add limit_wd, limit_wd, tmp3, lsr #3 > + add limit, limit, tmp1/* Adjust the limit for the extra. */ > + > + lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/ > + neg tmp1, tmp1/* Bits to alignment -64. */ > + mov tmp2, #~0 > + /*mask off the non-intended bytes before the start address.*/ > +CPU_BE( lsl tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/ > + /* Little-endian. Early bytes are at LSB. */ > +CPU_LE( lsr tmp2, tmp2, tmp1 ) > + > + orr data1, data1, tmp2 > + orr data2, data2, tmp2 > + b .Lstart_realigned > + > + /*src1 and src2 have different alignment offset.*/ > +.Lmisaligned8: > + cmp limit, #8 > + b.lo .Ltiny8proc /*limit < 8: compare byte by byte*/ > + > + and tmp1, src1, #7 > + neg tmp1, tmp1 > + add tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/ > + and tmp2, src2, #7 > + neg tmp2, tmp2 > + add tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/ > + subs tmp3, tmp1, tmp2 > + csel pos, tmp1, tmp2, hi /*Choose the maximum.*/ > + > + sub limit, limit, pos > + /*compare the proceeding bytes in the first 8 byte segment.*/ > +.Ltinycmp: > + ldrb data1w, [src1], #1 > + ldrb data2w, [src2], #1 > + subs pos, pos, #1 > + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ > + b.eq .Ltinycmp > + cbnz pos, 1f /*diff occurred before the last byte.*/ > + cmp data1w, data2w > + b.eq .Lstart_align > +1: > + sub result, data1, data2 > + ret > + > +.Lstart_align: > + lsr limit_wd, limit, #3 > + cbz limit_wd, .Lremain8 > + > + ands xzr, src1, #7 > + b.eq .Lrecal_offset > + /*process more leading bytes to make src1 aligned...*/ > + add src1, src1, tmp3 /*backwards src1 to alignment boundary*/ > + add src2, src2, tmp3 > + sub limit, limit, tmp3 > + lsr limit_wd, limit, #3 > + cbz limit_wd, .Lremain8 > + /*load 8 bytes from aligned SRC1..*/ > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + > + subs limit_wd, limit_wd, #1 > + eor diff, data1, data2 /*Non-zero if differences found.*/ > + csinv endloop, diff, xzr, ne > + cbnz endloop, .Lunequal_proc > + /*How far is the current SRC2 from the alignment boundary...*/ > + and tmp3, tmp3, #7 > + > +.Lrecal_offset:/*src1 is aligned now..*/ > + neg pos, tmp3 > +.Lloopcmp_proc: > + /* > + * Divide the eight bytes into two parts. First,backwards the src2 > + * to an alignment boundary,load eight bytes and compare from > + * the SRC2 alignment boundary. If all 8 bytes are equal,then start > + * the second part's comparison. Otherwise finish the comparison. > + * This special handle can garantee all the accesses are in the > + * thread/task space in avoid to overrange access. > + */ > + ldr data1, [src1,pos] > + ldr data2, [src2,pos] > + eor diff, data1, data2 /* Non-zero if differences found. */ > + cbnz diff, .Lnot_limit > + > + /*The second part process*/ > + ldr data1, [src1], #8 > + ldr data2, [src2], #8 > + eor diff, data1, data2 /* Non-zero if differences found. */ > + subs limit_wd, limit_wd, #1 > + csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/ > + cbz endloop, .Lloopcmp_proc > +.Lunequal_proc: > + cbz diff, .Lremain8 > + > +/*There is differnence occured in the latest comparison.*/ > +.Lnot_limit: > +/* > +* For little endian,reverse the low significant equal bits into MSB,then > +* following CLZ can find how many equal bits exist. > +*/ > +CPU_LE( rev diff, diff ) > +CPU_LE( rev data1, data1 ) > +CPU_LE( rev data2, data2 ) > + > + /* > + * The MS-non-zero bit of DIFF marks either the first bit > + * that is different, or the end of the significant data. > + * Shifting left now will bring the critical information into the > + * top bits. > + */ > + clz pos, diff > + lsl data1, data1, pos > + lsl data2, data2, pos > + /* > + * We need to zero-extend (char is unsigned) the value and then > + * perform a signed subtraction. > + */ > + lsr data1, data1, #56 > + sub result, data1, data2, lsr #56 > + ret > + > +.Lremain8: > + /* Limit % 8 == 0 =>. all data are equal.*/ > + ands limit, limit, #7 > + b.eq .Lret0 > + > +.Ltiny8proc: > + ldrb data1w, [src1], #1 > + ldrb data2w, [src2], #1 > + subs limit, limit, #1 > + > + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ > + b.eq .Ltiny8proc > + sub result, data1, data2 > + ret > +.Lret0: > + mov result, #0 > + ret > +ENDPROC(memcmp) > diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S > index c8197c6..7cc885d 100644 > --- a/xen/arch/arm/arm64/lib/memcpy.S > +++ b/xen/arch/arm/arm64/lib/memcpy.S > @@ -1,5 +1,13 @@ > /* > * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > @@ -15,6 +23,8 @@ > */ > > #include <xen/config.h> > +#include <asm/cache.h> > +#include "assembler.h" > > /* > * Copy a buffer from src to dest (alignment handled by the hardware) > @@ -26,27 +36,166 @@ > * Returns: > * x0 - dest > */ > +dstin .req x0 > +src .req x1 > +count .req x2 > +tmp1 .req x3 > +tmp1w .req w3 > +tmp2 .req x4 > +tmp2w .req w4 > +tmp3 .req x5 > +tmp3w .req w5 > +dst .req x6 > + > +A_l .req x7 > +A_h .req x8 > +B_l .req x9 > +B_h .req x10 > +C_l .req x11 > +C_h .req x12 > +D_l .req x13 > +D_h .req x14 > + > ENTRY(memcpy) > - mov x4, x0 > - subs x2, x2, #8 > - b.mi 2f > -1: ldr x3, [x1], #8 > - subs x2, x2, #8 > - str x3, [x4], #8 > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > - ldr w3, [x1], #4 > - sub x2, x2, #4 > - str w3, [x4], #4 > -3: adds x2, x2, #2 > - b.mi 4f > - ldrh w3, [x1], #2 > - sub x2, x2, #2 > - strh w3, [x4], #2 > -4: adds x2, x2, #1 > - b.mi 5f > - ldrb w3, [x1] > - strb w3, [x4] > -5: ret > + mov dst, dstin > + cmp count, #16 > + /*When memory length is less than 16, the accessed are not aligned.*/ > + b.lo .Ltiny15 > + > + neg tmp2, src > + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ > + b.eq .LSrcAligned > + sub count, count, tmp2 > + /* > + * Copy the leading memory data from src to dst in an increasing > + * address order.By this way,the risk of overwritting the source > + * memory data is eliminated when the distance between src and > + * dst is less than 16. The memory accesses here are alignment. > + */ > + tbz tmp2, #0, 1f > + ldrb tmp1w, [src], #1 > + strb tmp1w, [dst], #1 > +1: > + tbz tmp2, #1, 2f > + ldrh tmp1w, [src], #2 > + strh tmp1w, [dst], #2 > +2: > + tbz tmp2, #2, 3f > + ldr tmp1w, [src], #4 > + str tmp1w, [dst], #4 > +3: > + tbz tmp2, #3, .LSrcAligned > + ldr tmp1, [src],#8 > + str tmp1, [dst],#8 > + > +.LSrcAligned: > + cmp count, #64 > + b.ge .Lcpy_over64 > + /* > + * Deal with small copies quickly by dropping straight into the > + * exit block. > + */ > +.Ltail63: > + /* > + * Copy up to 48 bytes of data. At this point we only need the > + * bottom 6 bits of count to be accurate. > + */ > + ands tmp1, count, #0x30 > + b.eq .Ltiny15 > + cmp tmp1w, #0x20 > + b.eq 1f > + b.lt 2f > + ldp A_l, A_h, [src], #16 > + stp A_l, A_h, [dst], #16 > +1: > + ldp A_l, A_h, [src], #16 > + stp A_l, A_h, [dst], #16 > +2: > + ldp A_l, A_h, [src], #16 > + stp A_l, A_h, [dst], #16 > +.Ltiny15: > + /* > + * Prefer to break one ldp/stp into several load/store to access > + * memory in an increasing address order,rather than to load/store 16 > + * bytes from (src-16) to (dst-16) and to backward the src to aligned > + * address,which way is used in original cortex memcpy. If keeping > + * the original memcpy process here, memmove need to satisfy the > + * precondition that src address is at least 16 bytes bigger than dst > + * address,otherwise some source data will be overwritten when memove > + * call memcpy directly. To make memmove simpler and decouple the > + * memcpy's dependency on memmove, withdrew the original process. > + */ > + tbz count, #3, 1f > + ldr tmp1, [src], #8 > + str tmp1, [dst], #8 > +1: > + tbz count, #2, 2f > + ldr tmp1w, [src], #4 > + str tmp1w, [dst], #4 > +2: > + tbz count, #1, 3f > + ldrh tmp1w, [src], #2 > + strh tmp1w, [dst], #2 > +3: > + tbz count, #0, .Lexitfunc > + ldrb tmp1w, [src] > + strb tmp1w, [dst] > + > +.Lexitfunc: > + ret > + > +.Lcpy_over64: > + subs count, count, #128 > + b.ge .Lcpy_body_large > + /* > + * Less than 128 bytes to copy, so handle 64 here and then jump > + * to the tail. > + */ > + ldp A_l, A_h, [src],#16 > + stp A_l, A_h, [dst],#16 > + ldp B_l, B_h, [src],#16 > + ldp C_l, C_h, [src],#16 > + stp B_l, B_h, [dst],#16 > + stp C_l, C_h, [dst],#16 > + ldp D_l, D_h, [src],#16 > + stp D_l, D_h, [dst],#16 > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > + > + /* > + * Critical loop. Start at a new cache line boundary. Assuming > + * 64 bytes per line this ensures the entire loop is in one line. > + */ > + .p2align L1_CACHE_SHIFT > +.Lcpy_body_large: > + /* pre-get 64 bytes data. */ > + ldp A_l, A_h, [src],#16 > + ldp B_l, B_h, [src],#16 > + ldp C_l, C_h, [src],#16 > + ldp D_l, D_h, [src],#16 > +1: > + /* > + * interlace the load of next 64 bytes data block with store of the last > + * loaded 64 bytes data. > + */ > + stp A_l, A_h, [dst],#16 > + ldp A_l, A_h, [src],#16 > + stp B_l, B_h, [dst],#16 > + ldp B_l, B_h, [src],#16 > + stp C_l, C_h, [dst],#16 > + ldp C_l, C_h, [src],#16 > + stp D_l, D_h, [dst],#16 > + ldp D_l, D_h, [src],#16 > + subs count, count, #64 > + b.ge 1b > + stp A_l, A_h, [dst],#16 > + stp B_l, B_h, [dst],#16 > + stp C_l, C_h, [dst],#16 > + stp D_l, D_h, [dst],#16 > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > ENDPROC(memcpy) > diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S > index 1bf0936..f4065b9 100644 > --- a/xen/arch/arm/arm64/lib/memmove.S > +++ b/xen/arch/arm/arm64/lib/memmove.S > @@ -1,5 +1,13 @@ > /* > * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > @@ -15,6 +23,8 @@ > */ > > #include <xen/config.h> > +#include <asm/cache.h> > +#include "assembler.h" > > /* > * Move a buffer from src to test (alignment handled by the hardware). > @@ -27,30 +37,161 @@ > * Returns: > * x0 - dest > */ > +dstin .req x0 > +src .req x1 > +count .req x2 > +tmp1 .req x3 > +tmp1w .req w3 > +tmp2 .req x4 > +tmp2w .req w4 > +tmp3 .req x5 > +tmp3w .req w5 > +dst .req x6 > + > +A_l .req x7 > +A_h .req x8 > +B_l .req x9 > +B_h .req x10 > +C_l .req x11 > +C_h .req x12 > +D_l .req x13 > +D_h .req x14 > + > ENTRY(memmove) > - cmp x0, x1 > - b.ls memcpy > - add x4, x0, x2 > - add x1, x1, x2 > - subs x2, x2, #8 > - b.mi 2f > -1: ldr x3, [x1, #-8]! > - subs x2, x2, #8 > - str x3, [x4, #-8]! > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > - ldr w3, [x1, #-4]! > - sub x2, x2, #4 > - str w3, [x4, #-4]! > -3: adds x2, x2, #2 > - b.mi 4f > - ldrh w3, [x1, #-2]! > - sub x2, x2, #2 > - strh w3, [x4, #-2]! > -4: adds x2, x2, #1 > - b.mi 5f > - ldrb w3, [x1, #-1] > - strb w3, [x4, #-1] > -5: ret > + cmp dstin, src > + b.lo memcpy > + add tmp1, src, count > + cmp dstin, tmp1 > + b.hs memcpy /* No overlap. */ > + > + add dst, dstin, count > + add src, src, count > + cmp count, #16 > + b.lo .Ltail15 /*probably non-alignment accesses.*/ > + > + ands tmp2, src, #15 /* Bytes to reach alignment. */ > + b.eq .LSrcAligned > + sub count, count, tmp2 > + /* > + * process the aligned offset length to make the src aligned firstly. > + * those extra instructions' cost is acceptable. It also make the > + * coming accesses are based on aligned address. > + */ > + tbz tmp2, #0, 1f > + ldrb tmp1w, [src, #-1]! > + strb tmp1w, [dst, #-1]! > +1: > + tbz tmp2, #1, 2f > + ldrh tmp1w, [src, #-2]! > + strh tmp1w, [dst, #-2]! > +2: > + tbz tmp2, #2, 3f > + ldr tmp1w, [src, #-4]! > + str tmp1w, [dst, #-4]! > +3: > + tbz tmp2, #3, .LSrcAligned > + ldr tmp1, [src, #-8]! > + str tmp1, [dst, #-8]! > + > +.LSrcAligned: > + cmp count, #64 > + b.ge .Lcpy_over64 > + > + /* > + * Deal with small copies quickly by dropping straight into the > + * exit block. > + */ > +.Ltail63: > + /* > + * Copy up to 48 bytes of data. At this point we only need the > + * bottom 6 bits of count to be accurate. > + */ > + ands tmp1, count, #0x30 > + b.eq .Ltail15 > + cmp tmp1w, #0x20 > + b.eq 1f > + b.lt 2f > + ldp A_l, A_h, [src, #-16]! > + stp A_l, A_h, [dst, #-16]! > +1: > + ldp A_l, A_h, [src, #-16]! > + stp A_l, A_h, [dst, #-16]! > +2: > + ldp A_l, A_h, [src, #-16]! > + stp A_l, A_h, [dst, #-16]! > + > +.Ltail15: > + tbz count, #3, 1f > + ldr tmp1, [src, #-8]! > + str tmp1, [dst, #-8]! > +1: > + tbz count, #2, 2f > + ldr tmp1w, [src, #-4]! > + str tmp1w, [dst, #-4]! > +2: > + tbz count, #1, 3f > + ldrh tmp1w, [src, #-2]! > + strh tmp1w, [dst, #-2]! > +3: > + tbz count, #0, .Lexitfunc > + ldrb tmp1w, [src, #-1] > + strb tmp1w, [dst, #-1] > + > +.Lexitfunc: > + ret > + > +.Lcpy_over64: > + subs count, count, #128 > + b.ge .Lcpy_body_large > + /* > + * Less than 128 bytes to copy, so handle 64 bytes here and then jump > + * to the tail. > + */ > + ldp A_l, A_h, [src, #-16] > + stp A_l, A_h, [dst, #-16] > + ldp B_l, B_h, [src, #-32] > + ldp C_l, C_h, [src, #-48] > + stp B_l, B_h, [dst, #-32] > + stp C_l, C_h, [dst, #-48] > + ldp D_l, D_h, [src, #-64]! > + stp D_l, D_h, [dst, #-64]! > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > + > + /* > + * Critical loop. Start at a new cache line boundary. Assuming > + * 64 bytes per line this ensures the entire loop is in one line. > + */ > + .p2align L1_CACHE_SHIFT > +.Lcpy_body_large: > + /* pre-load 64 bytes data. */ > + ldp A_l, A_h, [src, #-16] > + ldp B_l, B_h, [src, #-32] > + ldp C_l, C_h, [src, #-48] > + ldp D_l, D_h, [src, #-64]! > +1: > + /* > + * interlace the load of next 64 bytes data block with store of the last > + * loaded 64 bytes data. > + */ > + stp A_l, A_h, [dst, #-16] > + ldp A_l, A_h, [src, #-16] > + stp B_l, B_h, [dst, #-32] > + ldp B_l, B_h, [src, #-32] > + stp C_l, C_h, [dst, #-48] > + ldp C_l, C_h, [src, #-48] > + stp D_l, D_h, [dst, #-64]! > + ldp D_l, D_h, [src, #-64]! > + subs count, count, #64 > + b.ge 1b > + stp A_l, A_h, [dst, #-16] > + stp B_l, B_h, [dst, #-32] > + stp C_l, C_h, [dst, #-48] > + stp D_l, D_h, [dst, #-64]! > + > + tst count, #0x3f > + b.ne .Ltail63 > + ret > ENDPROC(memmove) > diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S > index 25a4fb6..4ee714d 100644 > --- a/xen/arch/arm/arm64/lib/memset.S > +++ b/xen/arch/arm/arm64/lib/memset.S > @@ -1,5 +1,13 @@ > /* > * Copyright (C) 2013 ARM Ltd. > + * Copyright (C) 2013 Linaro. > + * > + * This code is based on glibc cortex strings work originally authored by Linaro > + * and re-licensed under GPLv2 for the Linux kernel. The original code can > + * be found @ > + * > + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ > + * files/head:/src/aarch64/ > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > @@ -15,6 +23,8 @@ > */ > > #include <xen/config.h> > +#include <asm/cache.h> > +#include "assembler.h" > > /* > * Fill in the buffer with character c (alignment handled by the hardware) > @@ -26,27 +36,181 @@ > * Returns: > * x0 - buf > */ > + > +dstin .req x0 > +val .req w1 > +count .req x2 > +tmp1 .req x3 > +tmp1w .req w3 > +tmp2 .req x4 > +tmp2w .req w4 > +zva_len_x .req x5 > +zva_len .req w5 > +zva_bits_x .req x6 > + > +A_l .req x7 > +A_lw .req w7 > +dst .req x8 > +tmp3w .req w9 > +tmp3 .req x9 > + > ENTRY(memset) > - mov x4, x0 > - and w1, w1, #0xff > - orr w1, w1, w1, lsl #8 > - orr w1, w1, w1, lsl #16 > - orr x1, x1, x1, lsl #32 > - subs x2, x2, #8 > - b.mi 2f > -1: str x1, [x4], #8 > - subs x2, x2, #8 > - b.pl 1b > -2: adds x2, x2, #4 > - b.mi 3f > - sub x2, x2, #4 > - str w1, [x4], #4 > -3: adds x2, x2, #2 > - b.mi 4f > - sub x2, x2, #2 > - strh w1, [x4], #2 > -4: adds x2, x2, #1 > - b.mi 5f > - strb w1, [x4] > -5: ret > + mov dst, dstin /* Preserve return value. */ > + and A_lw, val, #255 > + orr A_lw, A_lw, A_lw, lsl #8 > + orr A_lw, A_lw, A_lw, lsl #16 > + orr A_l, A_l, A_l, lsl #32 > + > + cmp count, #15 > + b.hi .Lover16_proc > + /*All store maybe are non-aligned..*/ > + tbz count, #3, 1f > + str A_l, [dst], #8 > +1: > + tbz count, #2, 2f > + str A_lw, [dst], #4 > +2: > + tbz count, #1, 3f > + strh A_lw, [dst], #2 > +3: > + tbz count, #0, 4f > + strb A_lw, [dst] > +4: > + ret > + > +.Lover16_proc: > + /*Whether the start address is aligned with 16.*/ > + neg tmp2, dst > + ands tmp2, tmp2, #15 > + b.eq .Laligned > +/* > +* The count is not less than 16, we can use stp to store the start 16 bytes, > +* then adjust the dst aligned with 16.This process will make the current > +* memory address at alignment boundary. > +*/ > + stp A_l, A_l, [dst] /*non-aligned store..*/ > + /*make the dst aligned..*/ > + sub count, count, tmp2 > + add dst, dst, tmp2 > + > +.Laligned: > + cbz A_l, .Lzero_mem > + > +.Ltail_maybe_long: > + cmp count, #64 > + b.ge .Lnot_short > +.Ltail63: > + ands tmp1, count, #0x30 > + b.eq 3f > + cmp tmp1w, #0x20 > + b.eq 1f > + b.lt 2f > + stp A_l, A_l, [dst], #16 > +1: > + stp A_l, A_l, [dst], #16 > +2: > + stp A_l, A_l, [dst], #16 > +/* > +* The last store length is less than 16,use stp to write last 16 bytes. > +* It will lead some bytes written twice and the access is non-aligned. > +*/ > +3: > + ands count, count, #15 > + cbz count, 4f > + add dst, dst, count > + stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */ > +4: > + ret > + > + /* > + * Critical loop. Start at a new cache line boundary. Assuming > + * 64 bytes per line, this ensures the entire loop is in one line. > + */ > + .p2align L1_CACHE_SHIFT > +.Lnot_short: > + sub dst, dst, #16/* Pre-bias. */ > + sub count, count, #64 > +1: > + stp A_l, A_l, [dst, #16] > + stp A_l, A_l, [dst, #32] > + stp A_l, A_l, [dst, #48] > + stp A_l, A_l, [dst, #64]! > + subs count, count, #64 > + b.ge 1b > + tst count, #0x3f > + add dst, dst, #16 > + b.ne .Ltail63 > +.Lexitfunc: > + ret > + > + /* > + * For zeroing memory, check to see if we can use the ZVA feature to > + * zero entire 'cache' lines. > + */ > +.Lzero_mem: > + cmp count, #63 > + b.le .Ltail63 > + /* > + * For zeroing small amounts of memory, it's not worth setting up > + * the line-clear code. > + */ > + cmp count, #128 > + b.lt .Lnot_short /*count is at least 128 bytes*/ > + > + mrs tmp1, dczid_el0 > + tbnz tmp1, #4, .Lnot_short > + mov tmp3w, #4 > + and zva_len, tmp1w, #15 /* Safety: other bits reserved. */ > + lsl zva_len, tmp3w, zva_len > + > + ands tmp3w, zva_len, #63 > + /* > + * ensure the zva_len is not less than 64. > + * It is not meaningful to use ZVA if the block size is less than 64. > + */ > + b.ne .Lnot_short > +.Lzero_by_line: > + /* > + * Compute how far we need to go to become suitably aligned. We're > + * already at quad-word alignment. > + */ > + cmp count, zva_len_x > + b.lt .Lnot_short /* Not enough to reach alignment. */ > + sub zva_bits_x, zva_len_x, #1 > + neg tmp2, dst > + ands tmp2, tmp2, zva_bits_x > + b.eq 2f /* Already aligned. */ > + /* Not aligned, check that there's enough to copy after alignment.*/ > + sub tmp1, count, tmp2 > + /* > + * grantee the remain length to be ZVA is bigger than 64, > + * avoid to make the 2f's process over mem range.*/ > + cmp tmp1, #64 > + ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */ > + b.lt .Lnot_short > + /* > + * We know that there's at least 64 bytes to zero and that it's safe > + * to overrun by 64 bytes. > + */ > + mov count, tmp1 > +1: > + stp A_l, A_l, [dst] > + stp A_l, A_l, [dst, #16] > + stp A_l, A_l, [dst, #32] > + subs tmp2, tmp2, #64 > + stp A_l, A_l, [dst, #48] > + add dst, dst, #64 > + b.ge 1b > + /* We've overrun a bit, so adjust dst downwards.*/ > + add dst, dst, tmp2 > +2: > + sub count, count, zva_len_x > +3: > + dc zva, dst > + add dst, dst, zva_len_x > + subs count, count, zva_len_x > + b.ge 3b > + ands count, count, zva_bits_x > + b.ne .Ltail_maybe_long > + ret > ENDPROC(memset) > diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h > index 3f4e7a1..9a511f2 100644 > --- a/xen/include/asm-arm/arm32/cmpxchg.h > +++ b/xen/include/asm-arm/arm32/cmpxchg.h > @@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > return ret; > } > > +#define xchg(ptr,x) \ > + ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) > + > /* > * Atomic compare and exchange. Compare OLD with MEM, if identical, > * store NEW in MEM. Return the initial value in MEM. Success is > diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h > index b5d50f2..b49219e 100644 > --- a/xen/include/asm-arm/arm64/atomic.h > +++ b/xen/include/asm-arm/arm64/atomic.h > @@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u) > > #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0) > > -#define smp_mb__before_atomic_dec() smp_mb() > -#define smp_mb__after_atomic_dec() smp_mb() > -#define smp_mb__before_atomic_inc() smp_mb() > -#define smp_mb__after_atomic_inc() smp_mb() > - > #endif > /* > * Local variables: > diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h > index 4e930ce..ae42b2f 100644 > --- a/xen/include/asm-arm/arm64/cmpxchg.h > +++ b/xen/include/asm-arm/arm64/cmpxchg.h > @@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size > } > > #define xchg(ptr,x) \ > - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) > +({ \ > + __typeof__(*(ptr)) __ret; \ > + __ret = (__typeof__(*(ptr))) \ > + __xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \ > + __ret; \ > +}) > > extern void __bad_cmpxchg(volatile void *ptr, int size); > > @@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old, > return ret; > } > > -#define cmpxchg(ptr,o,n) \ > - ((__typeof__(*(ptr)))__cmpxchg_mb((ptr), \ > - (unsigned long)(o), \ > - (unsigned long)(n), \ > - sizeof(*(ptr)))) > - > -#define cmpxchg_local(ptr,o,n) \ > - ((__typeof__(*(ptr)))__cmpxchg((ptr), \ > - (unsigned long)(o), \ > - (unsigned long)(n), \ > - sizeof(*(ptr)))) > +#define cmpxchg(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) __ret; \ > + __ret = (__typeof__(*(ptr))) \ > + __cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \ > + sizeof(*(ptr))); \ > + __ret; \ > +}) > + > +#define cmpxchg_local(ptr, o, n) \ > +({ \ > + __typeof__(*(ptr)) __ret; \ > + __ret = (__typeof__(*(ptr))) \ > + __cmpxchg((ptr), (unsigned long)(o), \ > + (unsigned long)(n), sizeof(*(ptr))); \ > + __ret; \ > +}) > > #endif > /* > diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h > index 3242762..dfad1fe 100644 > --- a/xen/include/asm-arm/string.h > +++ b/xen/include/asm-arm/string.h > @@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c); > #define __HAVE_ARCH_MEMCPY > extern void * memcpy(void *, const void *, __kernel_size_t); > > +#if defined(CONFIG_ARM_64) > +#define __HAVE_ARCH_MEMCMP > +extern int memcmp(const void *, const void *, __kernel_size_t); > +#endif > + > /* Some versions of gcc don't have this builtin. It's non-critical anyway. */ > #define __HAVE_ARCH_MEMMOVE > extern void *memmove(void *dest, const void *src, size_t n); > diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h > index 7aaaf50..ce3d38a 100644 > --- a/xen/include/asm-arm/system.h > +++ b/xen/include/asm-arm/system.h > @@ -33,9 +33,6 @@ > > #define smp_wmb() dmb(ishst) > > -#define xchg(ptr,x) \ > - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) > - > /* > * This is used to ensure the compiler did actually allocate the register we > * asked it for some inline assembly sequences. Apparently we can't trust >
On Fri, 2014-07-25 at 16:36 +0100, Julien Grall wrote: > Hi Ian, > > On 07/25/2014 04:22 PM, Ian Campbell wrote: > > The only really interesting changes here are the updates to mem* which update > > to actually optimised versions and introduce an optimised memcmp. > > I didn't read the whole code as I assume it's just a copy with few > changes from Linux. > > Acked-by: Julien Grall <julien.grall@linaro.org> Thanks. Julien also acked the other two patches via IRC, so I have applied. Ian.
diff --git a/xen/arch/arm/README.LinuxPrimitives b/xen/arch/arm/README.LinuxPrimitives index 6cd03ca..69eeb70 100644 --- a/xen/arch/arm/README.LinuxPrimitives +++ b/xen/arch/arm/README.LinuxPrimitives @@ -6,29 +6,26 @@ were last updated. arm64: ===================================================================== -bitops: last sync @ v3.14-rc7 (last commit: 8e86f0b) +bitops: last sync @ v3.16-rc6 (last commit: 8715466b6027) linux/arch/arm64/lib/bitops.S xen/arch/arm/arm64/lib/bitops.S linux/arch/arm64/include/asm/bitops.h xen/include/asm-arm/arm64/bitops.h --------------------------------------------------------------------- -cmpxchg: last sync @ v3.14-rc7 (last commit: 95c4189) +cmpxchg: last sync @ v3.16-rc6 (last commit: e1dfda9ced9b) linux/arch/arm64/include/asm/cmpxchg.h xen/include/asm-arm/arm64/cmpxchg.h -Skipped: - 60010e5 arm64: cmpxchg: update macros to prevent warnings - --------------------------------------------------------------------- -atomics: last sync @ v3.14-rc7 (last commit: 95c4189) +atomics: last sync @ v3.16-rc6 (last commit: 8715466b6027) linux/arch/arm64/include/asm/atomic.h xen/include/asm-arm/arm64/atomic.h --------------------------------------------------------------------- -spinlocks: last sync @ v3.14-rc7 (last commit: 95c4189) +spinlocks: last sync @ v3.16-rc6 (last commit: 95c4189689f9) linux/arch/arm64/include/asm/spinlock.h xen/include/asm-arm/arm64/spinlock.h @@ -38,30 +35,31 @@ Skipped: --------------------------------------------------------------------- -mem*: last sync @ v3.14-rc7 (last commit: 4a89922) +mem*: last sync @ v3.16-rc6 (last commit: d875c9b37240) -linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S -linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S -linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S -linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S +linux/arch/arm64/lib/memchr.S xen/arch/arm/arm64/lib/memchr.S +linux/arch/arm64/lib/memcmp.S xen/arch/arm/arm64/lib/memcmp.S +linux/arch/arm64/lib/memcpy.S xen/arch/arm/arm64/lib/memcpy.S +linux/arch/arm64/lib/memmove.S xen/arch/arm/arm64/lib/memmove.S +linux/arch/arm64/lib/memset.S xen/arch/arm/arm64/lib/memset.S -for i in memchr.S memcpy.S memmove.S memset.S ; do +for i in memchr.S memcmp.S memcpy.S memmove.S memset.S ; do diff -u linux/arch/arm64/lib/$i xen/arch/arm/arm64/lib/$i done --------------------------------------------------------------------- -str*: last sync @ v3.14-rc7 (last commit: 2b8cac8) +str*: last sync @ v3.16-rc6 (last commit: 2b8cac814cd5) -linux/arch/arm/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S -linux/arch/arm/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S +linux/arch/arm64/lib/strchr.S xen/arch/arm/arm64/lib/strchr.S +linux/arch/arm64/lib/strrchr.S xen/arch/arm/arm64/lib/strrchr.S --------------------------------------------------------------------- -{clear,copy}_page: last sync @ v3.14-rc7 (last commit: f27bb13) +{clear,copy}_page: last sync @ v3.16-rc6 (last commit: f27bb139c387) -linux/arch/arm64/lib/clear_page.S unused in Xen -linux/arch/arm64/lib/copy_page.S xen/arch/arm/arm64/lib/copy_page.S +linux/arch/arm64/lib/clear_page.S xen/arch/arm/arm64/lib/clear_page.S +linux/arch/arm64/lib/copy_page.S unused in Xen ===================================================================== arm32 diff --git a/xen/arch/arm/arm64/lib/Makefile b/xen/arch/arm/arm64/lib/Makefile index b895afa..2e7fb64 100644 --- a/xen/arch/arm/arm64/lib/Makefile +++ b/xen/arch/arm/arm64/lib/Makefile @@ -1,4 +1,4 @@ -obj-y += memcpy.o memmove.o memset.o memchr.o +obj-y += memcpy.o memcmp.o memmove.o memset.o memchr.o obj-y += clear_page.o obj-y += bitops.o find_next_bit.o obj-y += strchr.o strrchr.o diff --git a/xen/arch/arm/arm64/lib/assembler.h b/xen/arch/arm/arm64/lib/assembler.h new file mode 100644 index 0000000..84669d1 --- /dev/null +++ b/xen/arch/arm/arm64/lib/assembler.h @@ -0,0 +1,13 @@ +#ifndef __ASM_ASSEMBLER_H__ +#define __ASM_ASSEMBLER_H__ + +#ifndef __ASSEMBLY__ +#error "Only include this from assembly code" +#endif + +/* Only LE support so far */ +#define CPU_BE(x...) +#define CPU_LE(x...) x + +#endif /* __ASM_ASSEMBLER_H__ */ + diff --git a/xen/arch/arm/arm64/lib/memchr.S b/xen/arch/arm/arm64/lib/memchr.S index 3cc1b01..b04590c 100644 --- a/xen/arch/arm/arm64/lib/memchr.S +++ b/xen/arch/arm/arm64/lib/memchr.S @@ -18,6 +18,7 @@ */ #include <xen/config.h> +#include "assembler.h" /* * Find a character in an area of memory. diff --git a/xen/arch/arm/arm64/lib/memcmp.S b/xen/arch/arm/arm64/lib/memcmp.S new file mode 100644 index 0000000..9aad925 --- /dev/null +++ b/xen/arch/arm/arm64/lib/memcmp.S @@ -0,0 +1,258 @@ +/* + * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ + +#include <xen/config.h> +#include "assembler.h" + +/* +* compare memory areas(when two memory areas' offset are different, +* alignment handled by the hardware) +* +* Parameters: +* x0 - const memory area 1 pointer +* x1 - const memory area 2 pointer +* x2 - the maximal compare byte length +* Returns: +* x0 - a compare result, maybe less than, equal to, or greater than ZERO +*/ + +/* Parameters and result. */ +src1 .req x0 +src2 .req x1 +limit .req x2 +result .req x0 + +/* Internal variables. */ +data1 .req x3 +data1w .req w3 +data2 .req x4 +data2w .req w4 +has_nul .req x5 +diff .req x6 +endloop .req x7 +tmp1 .req x8 +tmp2 .req x9 +tmp3 .req x10 +pos .req x11 +limit_wd .req x12 +mask .req x13 + +ENTRY(memcmp) + cbz limit, .Lret0 + eor tmp1, src1, src2 + tst tmp1, #7 + b.ne .Lmisaligned8 + ands tmp1, src1, #7 + b.ne .Lmutual_align + sub limit_wd, limit, #1 /* limit != 0, so no underflow. */ + lsr limit_wd, limit_wd, #3 /* Convert to Dwords. */ + /* + * The input source addresses are at alignment boundary. + * Directly compare eight bytes each time. + */ +.Lloop_aligned: + ldr data1, [src1], #8 + ldr data2, [src2], #8 +.Lstart_realigned: + subs limit_wd, limit_wd, #1 + eor diff, data1, data2 /* Non-zero if differences found. */ + csinv endloop, diff, xzr, cs /* Last Dword or differences. */ + cbz endloop, .Lloop_aligned + + /* Not reached the limit, must have found a diff. */ + tbz limit_wd, #63, .Lnot_limit + + /* Limit % 8 == 0 => the diff is in the last 8 bytes. */ + ands limit, limit, #7 + b.eq .Lnot_limit + /* + * The remained bytes less than 8. It is needed to extract valid data + * from last eight bytes of the intended memory range. + */ + lsl limit, limit, #3 /* bytes-> bits. */ + mov mask, #~0 +CPU_BE( lsr mask, mask, limit ) +CPU_LE( lsl mask, mask, limit ) + bic data1, data1, mask + bic data2, data2, mask + + orr diff, diff, mask + b .Lnot_limit + +.Lmutual_align: + /* + * Sources are mutually aligned, but are not currently at an + * alignment boundary. Round down the addresses and then mask off + * the bytes that precede the start point. + */ + bic src1, src1, #7 + bic src2, src2, #7 + ldr data1, [src1], #8 + ldr data2, [src2], #8 + /* + * We can not add limit with alignment offset(tmp1) here. Since the + * addition probably make the limit overflown. + */ + sub limit_wd, limit, #1/*limit != 0, so no underflow.*/ + and tmp3, limit_wd, #7 + lsr limit_wd, limit_wd, #3 + add tmp3, tmp3, tmp1 + add limit_wd, limit_wd, tmp3, lsr #3 + add limit, limit, tmp1/* Adjust the limit for the extra. */ + + lsl tmp1, tmp1, #3/* Bytes beyond alignment -> bits.*/ + neg tmp1, tmp1/* Bits to alignment -64. */ + mov tmp2, #~0 + /*mask off the non-intended bytes before the start address.*/ +CPU_BE( lsl tmp2, tmp2, tmp1 )/*Big-endian.Early bytes are at MSB*/ + /* Little-endian. Early bytes are at LSB. */ +CPU_LE( lsr tmp2, tmp2, tmp1 ) + + orr data1, data1, tmp2 + orr data2, data2, tmp2 + b .Lstart_realigned + + /*src1 and src2 have different alignment offset.*/ +.Lmisaligned8: + cmp limit, #8 + b.lo .Ltiny8proc /*limit < 8: compare byte by byte*/ + + and tmp1, src1, #7 + neg tmp1, tmp1 + add tmp1, tmp1, #8/*valid length in the first 8 bytes of src1*/ + and tmp2, src2, #7 + neg tmp2, tmp2 + add tmp2, tmp2, #8/*valid length in the first 8 bytes of src2*/ + subs tmp3, tmp1, tmp2 + csel pos, tmp1, tmp2, hi /*Choose the maximum.*/ + + sub limit, limit, pos + /*compare the proceeding bytes in the first 8 byte segment.*/ +.Ltinycmp: + ldrb data1w, [src1], #1 + ldrb data2w, [src2], #1 + subs pos, pos, #1 + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ + b.eq .Ltinycmp + cbnz pos, 1f /*diff occurred before the last byte.*/ + cmp data1w, data2w + b.eq .Lstart_align +1: + sub result, data1, data2 + ret + +.Lstart_align: + lsr limit_wd, limit, #3 + cbz limit_wd, .Lremain8 + + ands xzr, src1, #7 + b.eq .Lrecal_offset + /*process more leading bytes to make src1 aligned...*/ + add src1, src1, tmp3 /*backwards src1 to alignment boundary*/ + add src2, src2, tmp3 + sub limit, limit, tmp3 + lsr limit_wd, limit, #3 + cbz limit_wd, .Lremain8 + /*load 8 bytes from aligned SRC1..*/ + ldr data1, [src1], #8 + ldr data2, [src2], #8 + + subs limit_wd, limit_wd, #1 + eor diff, data1, data2 /*Non-zero if differences found.*/ + csinv endloop, diff, xzr, ne + cbnz endloop, .Lunequal_proc + /*How far is the current SRC2 from the alignment boundary...*/ + and tmp3, tmp3, #7 + +.Lrecal_offset:/*src1 is aligned now..*/ + neg pos, tmp3 +.Lloopcmp_proc: + /* + * Divide the eight bytes into two parts. First,backwards the src2 + * to an alignment boundary,load eight bytes and compare from + * the SRC2 alignment boundary. If all 8 bytes are equal,then start + * the second part's comparison. Otherwise finish the comparison. + * This special handle can garantee all the accesses are in the + * thread/task space in avoid to overrange access. + */ + ldr data1, [src1,pos] + ldr data2, [src2,pos] + eor diff, data1, data2 /* Non-zero if differences found. */ + cbnz diff, .Lnot_limit + + /*The second part process*/ + ldr data1, [src1], #8 + ldr data2, [src2], #8 + eor diff, data1, data2 /* Non-zero if differences found. */ + subs limit_wd, limit_wd, #1 + csinv endloop, diff, xzr, ne/*if limit_wd is 0,will finish the cmp*/ + cbz endloop, .Lloopcmp_proc +.Lunequal_proc: + cbz diff, .Lremain8 + +/*There is differnence occured in the latest comparison.*/ +.Lnot_limit: +/* +* For little endian,reverse the low significant equal bits into MSB,then +* following CLZ can find how many equal bits exist. +*/ +CPU_LE( rev diff, diff ) +CPU_LE( rev data1, data1 ) +CPU_LE( rev data2, data2 ) + + /* + * The MS-non-zero bit of DIFF marks either the first bit + * that is different, or the end of the significant data. + * Shifting left now will bring the critical information into the + * top bits. + */ + clz pos, diff + lsl data1, data1, pos + lsl data2, data2, pos + /* + * We need to zero-extend (char is unsigned) the value and then + * perform a signed subtraction. + */ + lsr data1, data1, #56 + sub result, data1, data2, lsr #56 + ret + +.Lremain8: + /* Limit % 8 == 0 =>. all data are equal.*/ + ands limit, limit, #7 + b.eq .Lret0 + +.Ltiny8proc: + ldrb data1w, [src1], #1 + ldrb data2w, [src2], #1 + subs limit, limit, #1 + + ccmp data1w, data2w, #0, ne /* NZCV = 0b0000. */ + b.eq .Ltiny8proc + sub result, data1, data2 + ret +.Lret0: + mov result, #0 + ret +ENDPROC(memcmp) diff --git a/xen/arch/arm/arm64/lib/memcpy.S b/xen/arch/arm/arm64/lib/memcpy.S index c8197c6..7cc885d 100644 --- a/xen/arch/arm/arm64/lib/memcpy.S +++ b/xen/arch/arm/arm64/lib/memcpy.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -15,6 +23,8 @@ */ #include <xen/config.h> +#include <asm/cache.h> +#include "assembler.h" /* * Copy a buffer from src to dest (alignment handled by the hardware) @@ -26,27 +36,166 @@ * Returns: * x0 - dest */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +tmp3 .req x5 +tmp3w .req w5 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + ENTRY(memcpy) - mov x4, x0 - subs x2, x2, #8 - b.mi 2f -1: ldr x3, [x1], #8 - subs x2, x2, #8 - str x3, [x4], #8 - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - ldr w3, [x1], #4 - sub x2, x2, #4 - str w3, [x4], #4 -3: adds x2, x2, #2 - b.mi 4f - ldrh w3, [x1], #2 - sub x2, x2, #2 - strh w3, [x4], #2 -4: adds x2, x2, #1 - b.mi 5f - ldrb w3, [x1] - strb w3, [x4] -5: ret + mov dst, dstin + cmp count, #16 + /*When memory length is less than 16, the accessed are not aligned.*/ + b.lo .Ltiny15 + + neg tmp2, src + ands tmp2, tmp2, #15/* Bytes to reach alignment. */ + b.eq .LSrcAligned + sub count, count, tmp2 + /* + * Copy the leading memory data from src to dst in an increasing + * address order.By this way,the risk of overwritting the source + * memory data is eliminated when the distance between src and + * dst is less than 16. The memory accesses here are alignment. + */ + tbz tmp2, #0, 1f + ldrb tmp1w, [src], #1 + strb tmp1w, [dst], #1 +1: + tbz tmp2, #1, 2f + ldrh tmp1w, [src], #2 + strh tmp1w, [dst], #2 +2: + tbz tmp2, #2, 3f + ldr tmp1w, [src], #4 + str tmp1w, [dst], #4 +3: + tbz tmp2, #3, .LSrcAligned + ldr tmp1, [src],#8 + str tmp1, [dst],#8 + +.LSrcAligned: + cmp count, #64 + b.ge .Lcpy_over64 + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltiny15 + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp A_l, A_h, [src], #16 + stp A_l, A_h, [dst], #16 +1: + ldp A_l, A_h, [src], #16 + stp A_l, A_h, [dst], #16 +2: + ldp A_l, A_h, [src], #16 + stp A_l, A_h, [dst], #16 +.Ltiny15: + /* + * Prefer to break one ldp/stp into several load/store to access + * memory in an increasing address order,rather than to load/store 16 + * bytes from (src-16) to (dst-16) and to backward the src to aligned + * address,which way is used in original cortex memcpy. If keeping + * the original memcpy process here, memmove need to satisfy the + * precondition that src address is at least 16 bytes bigger than dst + * address,otherwise some source data will be overwritten when memove + * call memcpy directly. To make memmove simpler and decouple the + * memcpy's dependency on memmove, withdrew the original process. + */ + tbz count, #3, 1f + ldr tmp1, [src], #8 + str tmp1, [dst], #8 +1: + tbz count, #2, 2f + ldr tmp1w, [src], #4 + str tmp1w, [dst], #4 +2: + tbz count, #1, 3f + ldrh tmp1w, [src], #2 + strh tmp1w, [dst], #2 +3: + tbz count, #0, .Lexitfunc + ldrb tmp1w, [src] + strb tmp1w, [dst] + +.Lexitfunc: + ret + +.Lcpy_over64: + subs count, count, #128 + b.ge .Lcpy_body_large + /* + * Less than 128 bytes to copy, so handle 64 here and then jump + * to the tail. + */ + ldp A_l, A_h, [src],#16 + stp A_l, A_h, [dst],#16 + ldp B_l, B_h, [src],#16 + ldp C_l, C_h, [src],#16 + stp B_l, B_h, [dst],#16 + stp C_l, C_h, [dst],#16 + ldp D_l, D_h, [src],#16 + stp D_l, D_h, [dst],#16 + + tst count, #0x3f + b.ne .Ltail63 + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large: + /* pre-get 64 bytes data. */ + ldp A_l, A_h, [src],#16 + ldp B_l, B_h, [src],#16 + ldp C_l, C_h, [src],#16 + ldp D_l, D_h, [src],#16 +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stp A_l, A_h, [dst],#16 + ldp A_l, A_h, [src],#16 + stp B_l, B_h, [dst],#16 + ldp B_l, B_h, [src],#16 + stp C_l, C_h, [dst],#16 + ldp C_l, C_h, [src],#16 + stp D_l, D_h, [dst],#16 + ldp D_l, D_h, [src],#16 + subs count, count, #64 + b.ge 1b + stp A_l, A_h, [dst],#16 + stp B_l, B_h, [dst],#16 + stp C_l, C_h, [dst],#16 + stp D_l, D_h, [dst],#16 + + tst count, #0x3f + b.ne .Ltail63 + ret ENDPROC(memcpy) diff --git a/xen/arch/arm/arm64/lib/memmove.S b/xen/arch/arm/arm64/lib/memmove.S index 1bf0936..f4065b9 100644 --- a/xen/arch/arm/arm64/lib/memmove.S +++ b/xen/arch/arm/arm64/lib/memmove.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -15,6 +23,8 @@ */ #include <xen/config.h> +#include <asm/cache.h> +#include "assembler.h" /* * Move a buffer from src to test (alignment handled by the hardware). @@ -27,30 +37,161 @@ * Returns: * x0 - dest */ +dstin .req x0 +src .req x1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +tmp3 .req x5 +tmp3w .req w5 +dst .req x6 + +A_l .req x7 +A_h .req x8 +B_l .req x9 +B_h .req x10 +C_l .req x11 +C_h .req x12 +D_l .req x13 +D_h .req x14 + ENTRY(memmove) - cmp x0, x1 - b.ls memcpy - add x4, x0, x2 - add x1, x1, x2 - subs x2, x2, #8 - b.mi 2f -1: ldr x3, [x1, #-8]! - subs x2, x2, #8 - str x3, [x4, #-8]! - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - ldr w3, [x1, #-4]! - sub x2, x2, #4 - str w3, [x4, #-4]! -3: adds x2, x2, #2 - b.mi 4f - ldrh w3, [x1, #-2]! - sub x2, x2, #2 - strh w3, [x4, #-2]! -4: adds x2, x2, #1 - b.mi 5f - ldrb w3, [x1, #-1] - strb w3, [x4, #-1] -5: ret + cmp dstin, src + b.lo memcpy + add tmp1, src, count + cmp dstin, tmp1 + b.hs memcpy /* No overlap. */ + + add dst, dstin, count + add src, src, count + cmp count, #16 + b.lo .Ltail15 /*probably non-alignment accesses.*/ + + ands tmp2, src, #15 /* Bytes to reach alignment. */ + b.eq .LSrcAligned + sub count, count, tmp2 + /* + * process the aligned offset length to make the src aligned firstly. + * those extra instructions' cost is acceptable. It also make the + * coming accesses are based on aligned address. + */ + tbz tmp2, #0, 1f + ldrb tmp1w, [src, #-1]! + strb tmp1w, [dst, #-1]! +1: + tbz tmp2, #1, 2f + ldrh tmp1w, [src, #-2]! + strh tmp1w, [dst, #-2]! +2: + tbz tmp2, #2, 3f + ldr tmp1w, [src, #-4]! + str tmp1w, [dst, #-4]! +3: + tbz tmp2, #3, .LSrcAligned + ldr tmp1, [src, #-8]! + str tmp1, [dst, #-8]! + +.LSrcAligned: + cmp count, #64 + b.ge .Lcpy_over64 + + /* + * Deal with small copies quickly by dropping straight into the + * exit block. + */ +.Ltail63: + /* + * Copy up to 48 bytes of data. At this point we only need the + * bottom 6 bits of count to be accurate. + */ + ands tmp1, count, #0x30 + b.eq .Ltail15 + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! +1: + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! +2: + ldp A_l, A_h, [src, #-16]! + stp A_l, A_h, [dst, #-16]! + +.Ltail15: + tbz count, #3, 1f + ldr tmp1, [src, #-8]! + str tmp1, [dst, #-8]! +1: + tbz count, #2, 2f + ldr tmp1w, [src, #-4]! + str tmp1w, [dst, #-4]! +2: + tbz count, #1, 3f + ldrh tmp1w, [src, #-2]! + strh tmp1w, [dst, #-2]! +3: + tbz count, #0, .Lexitfunc + ldrb tmp1w, [src, #-1] + strb tmp1w, [dst, #-1] + +.Lexitfunc: + ret + +.Lcpy_over64: + subs count, count, #128 + b.ge .Lcpy_body_large + /* + * Less than 128 bytes to copy, so handle 64 bytes here and then jump + * to the tail. + */ + ldp A_l, A_h, [src, #-16] + stp A_l, A_h, [dst, #-16] + ldp B_l, B_h, [src, #-32] + ldp C_l, C_h, [src, #-48] + stp B_l, B_h, [dst, #-32] + stp C_l, C_h, [dst, #-48] + ldp D_l, D_h, [src, #-64]! + stp D_l, D_h, [dst, #-64]! + + tst count, #0x3f + b.ne .Ltail63 + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lcpy_body_large: + /* pre-load 64 bytes data. */ + ldp A_l, A_h, [src, #-16] + ldp B_l, B_h, [src, #-32] + ldp C_l, C_h, [src, #-48] + ldp D_l, D_h, [src, #-64]! +1: + /* + * interlace the load of next 64 bytes data block with store of the last + * loaded 64 bytes data. + */ + stp A_l, A_h, [dst, #-16] + ldp A_l, A_h, [src, #-16] + stp B_l, B_h, [dst, #-32] + ldp B_l, B_h, [src, #-32] + stp C_l, C_h, [dst, #-48] + ldp C_l, C_h, [src, #-48] + stp D_l, D_h, [dst, #-64]! + ldp D_l, D_h, [src, #-64]! + subs count, count, #64 + b.ge 1b + stp A_l, A_h, [dst, #-16] + stp B_l, B_h, [dst, #-32] + stp C_l, C_h, [dst, #-48] + stp D_l, D_h, [dst, #-64]! + + tst count, #0x3f + b.ne .Ltail63 + ret ENDPROC(memmove) diff --git a/xen/arch/arm/arm64/lib/memset.S b/xen/arch/arm/arm64/lib/memset.S index 25a4fb6..4ee714d 100644 --- a/xen/arch/arm/arm64/lib/memset.S +++ b/xen/arch/arm/arm64/lib/memset.S @@ -1,5 +1,13 @@ /* * Copyright (C) 2013 ARM Ltd. + * Copyright (C) 2013 Linaro. + * + * This code is based on glibc cortex strings work originally authored by Linaro + * and re-licensed under GPLv2 for the Linux kernel. The original code can + * be found @ + * + * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ + * files/head:/src/aarch64/ * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as @@ -15,6 +23,8 @@ */ #include <xen/config.h> +#include <asm/cache.h> +#include "assembler.h" /* * Fill in the buffer with character c (alignment handled by the hardware) @@ -26,27 +36,181 @@ * Returns: * x0 - buf */ + +dstin .req x0 +val .req w1 +count .req x2 +tmp1 .req x3 +tmp1w .req w3 +tmp2 .req x4 +tmp2w .req w4 +zva_len_x .req x5 +zva_len .req w5 +zva_bits_x .req x6 + +A_l .req x7 +A_lw .req w7 +dst .req x8 +tmp3w .req w9 +tmp3 .req x9 + ENTRY(memset) - mov x4, x0 - and w1, w1, #0xff - orr w1, w1, w1, lsl #8 - orr w1, w1, w1, lsl #16 - orr x1, x1, x1, lsl #32 - subs x2, x2, #8 - b.mi 2f -1: str x1, [x4], #8 - subs x2, x2, #8 - b.pl 1b -2: adds x2, x2, #4 - b.mi 3f - sub x2, x2, #4 - str w1, [x4], #4 -3: adds x2, x2, #2 - b.mi 4f - sub x2, x2, #2 - strh w1, [x4], #2 -4: adds x2, x2, #1 - b.mi 5f - strb w1, [x4] -5: ret + mov dst, dstin /* Preserve return value. */ + and A_lw, val, #255 + orr A_lw, A_lw, A_lw, lsl #8 + orr A_lw, A_lw, A_lw, lsl #16 + orr A_l, A_l, A_l, lsl #32 + + cmp count, #15 + b.hi .Lover16_proc + /*All store maybe are non-aligned..*/ + tbz count, #3, 1f + str A_l, [dst], #8 +1: + tbz count, #2, 2f + str A_lw, [dst], #4 +2: + tbz count, #1, 3f + strh A_lw, [dst], #2 +3: + tbz count, #0, 4f + strb A_lw, [dst] +4: + ret + +.Lover16_proc: + /*Whether the start address is aligned with 16.*/ + neg tmp2, dst + ands tmp2, tmp2, #15 + b.eq .Laligned +/* +* The count is not less than 16, we can use stp to store the start 16 bytes, +* then adjust the dst aligned with 16.This process will make the current +* memory address at alignment boundary. +*/ + stp A_l, A_l, [dst] /*non-aligned store..*/ + /*make the dst aligned..*/ + sub count, count, tmp2 + add dst, dst, tmp2 + +.Laligned: + cbz A_l, .Lzero_mem + +.Ltail_maybe_long: + cmp count, #64 + b.ge .Lnot_short +.Ltail63: + ands tmp1, count, #0x30 + b.eq 3f + cmp tmp1w, #0x20 + b.eq 1f + b.lt 2f + stp A_l, A_l, [dst], #16 +1: + stp A_l, A_l, [dst], #16 +2: + stp A_l, A_l, [dst], #16 +/* +* The last store length is less than 16,use stp to write last 16 bytes. +* It will lead some bytes written twice and the access is non-aligned. +*/ +3: + ands count, count, #15 + cbz count, 4f + add dst, dst, count + stp A_l, A_l, [dst, #-16] /* Repeat some/all of last store. */ +4: + ret + + /* + * Critical loop. Start at a new cache line boundary. Assuming + * 64 bytes per line, this ensures the entire loop is in one line. + */ + .p2align L1_CACHE_SHIFT +.Lnot_short: + sub dst, dst, #16/* Pre-bias. */ + sub count, count, #64 +1: + stp A_l, A_l, [dst, #16] + stp A_l, A_l, [dst, #32] + stp A_l, A_l, [dst, #48] + stp A_l, A_l, [dst, #64]! + subs count, count, #64 + b.ge 1b + tst count, #0x3f + add dst, dst, #16 + b.ne .Ltail63 +.Lexitfunc: + ret + + /* + * For zeroing memory, check to see if we can use the ZVA feature to + * zero entire 'cache' lines. + */ +.Lzero_mem: + cmp count, #63 + b.le .Ltail63 + /* + * For zeroing small amounts of memory, it's not worth setting up + * the line-clear code. + */ + cmp count, #128 + b.lt .Lnot_short /*count is at least 128 bytes*/ + + mrs tmp1, dczid_el0 + tbnz tmp1, #4, .Lnot_short + mov tmp3w, #4 + and zva_len, tmp1w, #15 /* Safety: other bits reserved. */ + lsl zva_len, tmp3w, zva_len + + ands tmp3w, zva_len, #63 + /* + * ensure the zva_len is not less than 64. + * It is not meaningful to use ZVA if the block size is less than 64. + */ + b.ne .Lnot_short +.Lzero_by_line: + /* + * Compute how far we need to go to become suitably aligned. We're + * already at quad-word alignment. + */ + cmp count, zva_len_x + b.lt .Lnot_short /* Not enough to reach alignment. */ + sub zva_bits_x, zva_len_x, #1 + neg tmp2, dst + ands tmp2, tmp2, zva_bits_x + b.eq 2f /* Already aligned. */ + /* Not aligned, check that there's enough to copy after alignment.*/ + sub tmp1, count, tmp2 + /* + * grantee the remain length to be ZVA is bigger than 64, + * avoid to make the 2f's process over mem range.*/ + cmp tmp1, #64 + ccmp tmp1, zva_len_x, #8, ge /* NZCV=0b1000 */ + b.lt .Lnot_short + /* + * We know that there's at least 64 bytes to zero and that it's safe + * to overrun by 64 bytes. + */ + mov count, tmp1 +1: + stp A_l, A_l, [dst] + stp A_l, A_l, [dst, #16] + stp A_l, A_l, [dst, #32] + subs tmp2, tmp2, #64 + stp A_l, A_l, [dst, #48] + add dst, dst, #64 + b.ge 1b + /* We've overrun a bit, so adjust dst downwards.*/ + add dst, dst, tmp2 +2: + sub count, count, zva_len_x +3: + dc zva, dst + add dst, dst, zva_len_x + subs count, count, zva_len_x + b.ge 3b + ands count, count, zva_bits_x + b.ne .Ltail_maybe_long + ret ENDPROC(memset) diff --git a/xen/include/asm-arm/arm32/cmpxchg.h b/xen/include/asm-arm/arm32/cmpxchg.h index 3f4e7a1..9a511f2 100644 --- a/xen/include/asm-arm/arm32/cmpxchg.h +++ b/xen/include/asm-arm/arm32/cmpxchg.h @@ -40,6 +40,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size return ret; } +#define xchg(ptr,x) \ + ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) + /* * Atomic compare and exchange. Compare OLD with MEM, if identical, * store NEW in MEM. Return the initial value in MEM. Success is diff --git a/xen/include/asm-arm/arm64/atomic.h b/xen/include/asm-arm/arm64/atomic.h index b5d50f2..b49219e 100644 --- a/xen/include/asm-arm/arm64/atomic.h +++ b/xen/include/asm-arm/arm64/atomic.h @@ -136,11 +136,6 @@ static inline int __atomic_add_unless(atomic_t *v, int a, int u) #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0) -#define smp_mb__before_atomic_dec() smp_mb() -#define smp_mb__after_atomic_dec() smp_mb() -#define smp_mb__before_atomic_inc() smp_mb() -#define smp_mb__after_atomic_inc() smp_mb() - #endif /* * Local variables: diff --git a/xen/include/asm-arm/arm64/cmpxchg.h b/xen/include/asm-arm/arm64/cmpxchg.h index 4e930ce..ae42b2f 100644 --- a/xen/include/asm-arm/arm64/cmpxchg.h +++ b/xen/include/asm-arm/arm64/cmpxchg.h @@ -54,7 +54,12 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr, int size } #define xchg(ptr,x) \ - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) +({ \ + __typeof__(*(ptr)) __ret; \ + __ret = (__typeof__(*(ptr))) \ + __xchg((unsigned long)(x), (ptr), sizeof(*(ptr))); \ + __ret; \ +}) extern void __bad_cmpxchg(volatile void *ptr, int size); @@ -144,17 +149,23 @@ static inline unsigned long __cmpxchg_mb(volatile void *ptr, unsigned long old, return ret; } -#define cmpxchg(ptr,o,n) \ - ((__typeof__(*(ptr)))__cmpxchg_mb((ptr), \ - (unsigned long)(o), \ - (unsigned long)(n), \ - sizeof(*(ptr)))) - -#define cmpxchg_local(ptr,o,n) \ - ((__typeof__(*(ptr)))__cmpxchg((ptr), \ - (unsigned long)(o), \ - (unsigned long)(n), \ - sizeof(*(ptr)))) +#define cmpxchg(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) __ret; \ + __ret = (__typeof__(*(ptr))) \ + __cmpxchg_mb((ptr), (unsigned long)(o), (unsigned long)(n), \ + sizeof(*(ptr))); \ + __ret; \ +}) + +#define cmpxchg_local(ptr, o, n) \ +({ \ + __typeof__(*(ptr)) __ret; \ + __ret = (__typeof__(*(ptr))) \ + __cmpxchg((ptr), (unsigned long)(o), \ + (unsigned long)(n), sizeof(*(ptr))); \ + __ret; \ +}) #endif /* diff --git a/xen/include/asm-arm/string.h b/xen/include/asm-arm/string.h index 3242762..dfad1fe 100644 --- a/xen/include/asm-arm/string.h +++ b/xen/include/asm-arm/string.h @@ -17,6 +17,11 @@ extern char * strchr(const char * s, int c); #define __HAVE_ARCH_MEMCPY extern void * memcpy(void *, const void *, __kernel_size_t); +#if defined(CONFIG_ARM_64) +#define __HAVE_ARCH_MEMCMP +extern int memcmp(const void *, const void *, __kernel_size_t); +#endif + /* Some versions of gcc don't have this builtin. It's non-critical anyway. */ #define __HAVE_ARCH_MEMMOVE extern void *memmove(void *dest, const void *src, size_t n); diff --git a/xen/include/asm-arm/system.h b/xen/include/asm-arm/system.h index 7aaaf50..ce3d38a 100644 --- a/xen/include/asm-arm/system.h +++ b/xen/include/asm-arm/system.h @@ -33,9 +33,6 @@ #define smp_wmb() dmb(ishst) -#define xchg(ptr,x) \ - ((__typeof__(*(ptr)))__xchg((unsigned long)(x),(ptr),sizeof(*(ptr)))) - /* * This is used to ensure the compiler did actually allocate the register we * asked it for some inline assembly sequences. Apparently we can't trust