ARMv7AndThumb
Differences between revisions 2 and 3
6559
Comment:
|
6669
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= ARMv7 and Thumb = See [[ARM/Thumb2]] for details of the ongoing efforts in this area. == Slides == |
ARMv7 and Thumb
See ARM/Thumb2 for details of the ongoing efforts in this area.
Slides
== Make gcc default to ARMv7 and Thumb2 on the ARM architecture == * Lucid moved to armv7 + thumb2 support. * Karmic moved to ARMv6 + VFP2-D16 * Kernel currently builds for ARMv7, but unknown if it is using thumb in the builds * See ARM() and THUMB() macros, controlled by CONFIG_THUMB2_KERNEL * Packages with inline assembler will likely explode. ** VFP available to ARM and Thumb sides ARM presentation Slide 1 * Newest revision of the ARM architecture * For Ubuntu targeting armv7-a Slide 2 Floating-point architecture which accompanies ARMv7 * -mfpu=vfp3-d16 -mfloat-abi=softfp * What about -mfpu=vfp3-d16-fp16 implemented by CodeSourcery's toolchain and GCC 4.5? is this something we want? * superset of the VFPv2 floating-point instruction set supported by karmic * Has 16 double precision registers, like VFP2 * cortex a8/a9 implements 32 double precision registers, but some armv7a implementations do not * what about -mfloat-abi=hard? * support present in gcc trunk * Is there a triplet for hard float? Would be good to have * We might be able to support both with multiarch * expected in 4.5 release branch (not yet branched) * better performance and more register bandwidth * simplified JIT's (Java, JavaScript, Flash ...) * Would be an ABI break with Debian armel architecture ** Might also break compatibility with Ubuntu itself, may requiring rebootstrapping the port * some asm (slide changed before edit complete) Slide 3 * -mthumb *thumb-2 enabled with -mthumb -march=armv7-a (or greater) * superset of the compact thumb instruction set * ssupported in all armv7 implementations * performance similar to ARM (slide changed before edit complete) Slide 4 * Thumb-2 uses the new "Unified Assembly Language" syntax (extension of ARM syntax) * UAL generated automatically by GCC when building for Thumb-2 * Hand writted UAL can also be assembled direction to ARM or Thumb-2 code without modification *.syntax unified enables new syntax in GNU as *note that thumb-2 has some restrictions on operand combinations and ranges which differ from ARM * enamine output of gcc -march=armv7-a -mthumb -S to see examples * GNU as (Karmic or newer) can also process legacy ARM assembler as if it were UAL * gcc -mthumb -Wa,-implicit-it=thumb * non-optimal but will compile most legacy ARM inline ASM into Thumb-2 Slide 5 * SWP instruction * Simutaneous bus-locked red and write of memory location * Was used to implement atomic operations (mutexes etc.) * Deprecated since ARMv6 * Bad performance impact for complex architectures, especially multi-core * Not supported at all in Thumb2 * Seek and Destroy! * Implementation in multi-core ARMv7 processors and not guaranteed to be SMP-Safe (Cortex A9 etc.) * Where possible, port to gcc intrinsics * __sync_val_compare_and_swap and friends, see gcc info pages * In other cases, need to port to LDREX/STREX family of instructions for processors which support them * SWP and LDREX/STREX do not interoperate * eg. can't access the same mutex with SWP and LDREX/STREX and expect it to work * When patching conditionally retain SWP support so upstream, Debian etc. and cann stil be built for older architectures Slide 6 * Proposal - binutils * Default architecture for gcc and as * -march-armv7-a * Default instruction set for as * -marm * - The traditional default; most out-of-line assembler (and the binutils testsuite) except this * For out-of-line assembler, it is usual to control the instruction set with directives within the assembler source. * Default floating-point configuration * -mfpu=vfpv3-d16 -mfloat-abi=softfp * Not all implementations have the 32-register variant * Not ready for hard-float ABI yet? Slide 7 * Proposal - gcc * Architecture baseline same as for binutils * -marhc=armv7-a -mfpu-vfp3-d16 -mfloat-abi=softfp * Default tuning for gcc * none proposed * -mtune=cortex-a8 is currently the default for -march=armv7 anyway * in the future, the upstream tools should give sensible "generic tuning" provided there is no explicit -mtunes * Qualcomm has a tool to post-process binaries to reorder NEON instructions in a more efficient way; it's probably best to consider -mtune= to implement this properly, but it seems there's room for a new tool -- similar to prelink in spirit -- which would post-process binaries after install to tune them for a particular subarch * performance implications: how long does it take to post-process binaries? Probably not much since NEON binaries should be in /lib/vfp/neon and there aren't that many; we could do it with a trigger on that dir -- no idea how fast the tool is * Default instruction set for gcc * -mthumb * Generates Thumb-2 code * Legacy inloine assembler support - most old inline asm won't compile to T2 * Pass -mimplicit-it=thumb to as * Allows the assembler to guess UAL conditional annotations from traditional ARM inline assembler when compiling to Thumb-2 * Non-optimal but "should work" for a lot of legacy inline asm. * Not so suitable for hand-optimised ARM assembler, due to non-optimal nature * Harmless / inactive in all othe rcases Slide 8 Processor-Specific Features in GCC * For ARM implemntations (Cortex-A8, A9), the NEON SIMD instruction set may be available on some platforms * SMD integer (8, 16, 32, 64-bit) and float (32-bit) operations * if used, need to build "with NEON", and "without NEON" code and select the right implementation at package-install time or run-time * Worth experimenting with, where support is available * -mfpu=neon -ftree-vectorize * -ftree-vectorize is turned on by default on -O3 * Also need -ffast-math for NEON floating point to be used (NEON floating-point arithmetic is not fully IEEE compliant) * Need to code carefully to get the best vectorization * C99 restricted pointers * Alignment considerations for optimal SIMD load/store performance * Third-party processor implementations may have their own special extensions From general questions time - can we implement kernel based optimised routine (memcpy for example) ala VDSO page used on PowerPC and the equivalent on x86. Rationale being may allow a single point of derivative specific optimisation. - this is currently used for gcc __sync intrinsics?
Mobile/ARMv7AndThumb (last edited 2009-12-02 15:49:49 by pool-96-226-234-93)