<> == Summary == Lucid will drop support for older arm specification in favour of armv7 with thumb2 and softfp. == Release Notes == TBD == Rational == Ubuntu targetted devices for which the armv7 architecture is the lowest reasonable bound; hence taking advantage of the improved instruction set of such modern applications is essential as it will come with notable speed and size improvements. == Design == Starting the Lucid cycle the toolchain will get default options optimized for our new target architecture: ARMv7. From there on there will be a continous effort to track and fix regressions and build failures in user-space applications. Detailed information about the changed compiler flags together with initial porting instructions are documented on the [[ARM/Thumb2]] wiki page. == Implementation == * see work items documented on blueprint page (linked from above) == Notes/Comments == * put your comments/notes here (dont forget a name). == ARM presentation Slides == === Make gcc default to ARMv7 and Thumb2 on the ARM architecture === * Lucid moved to armv7 + thumb2 support. * Karmic moved to ARMv6 + VFP2-D16 * Kernel currently builds for ARMv7, but unknown if it is using thumb in the builds * See ARM() and THUMB() macros, controlled by CONFIG_THUMB2_KERNEL * Packages with inline assembler will likely explode. ** VFP available to ARM and Thumb sides === Slide 1 === * Newest revision of the ARM architecture * For Ubuntu targeting armv7-a === Slide 2 === Floating-point architecture which accompanies ARMv7 * -mfpu=vfp3-d16 -mfloat-abi=softfp * What about -mfpu=vfp3-d16-fp16 implemented by CodeSourcery's toolchain and GCC 4.5? is this something we want? * superset of the VFPv2 floating-point instruction set supported by karmic * Has 16 double precision registers, like VFP2 * cortex a8/a9 implements 32 double precision registers, but some armv7a implementations do not * what about -mfloat-abi=hard? * support present in gcc trunk * Is there a triplet for hard float? Would be good to have * We might be able to support both with multiarch * expected in 4.5 release branch (not yet branched) * better performance and more register bandwidth * simplified JIT's (Java, JavaScript, Flash ...) * Would be an ABI break with Debian armel architecture ** Might also break compatibility with Ubuntu itself, may requiring rebootstrapping the port * some asm (slide changed before edit complete) === Slide 3 === * -mthumb *thumb-2 enabled with -mthumb -march=armv7-a (or greater) * superset of the compact thumb instruction set * ssupported in all armv7 implementations * performance similar to ARM (slide changed before edit complete) === Slide 4 === * Thumb-2 uses the new "Unified Assembly Language" syntax (extension of ARM syntax) * UAL generated automatically by GCC when building for Thumb-2 * Hand writted UAL can also be assembled direction to ARM or Thumb-2 code without modification *.syntax unified enables new syntax in GNU as *note that thumb-2 has some restrictions on operand combinations and ranges which differ from ARM * enamine output of gcc -march=armv7-a -mthumb -S to see examples * GNU as (Karmic or newer) can also process legacy ARM assembler as if it were UAL * gcc -mthumb -Wa,-implicit-it=thumb * non-optimal but will compile most legacy ARM inline ASM into Thumb-2 === Slide 5 === * SWP instruction * Simutaneous bus-locked red and write of memory location * Was used to implement atomic operations (mutexes etc.) * Deprecated since ARMv6 * Bad performance impact for complex architectures, especially multi-core * Not supported at all in Thumb2 * Seek and Destroy! * Implementation in multi-core ARMv7 processors and not guaranteed to be SMP-Safe (Cortex A9 etc.) * Where possible, port to gcc intrinsics * __sync_val_compare_and_swap and friends, see gcc info pages * In other cases, need to port to LDREX/STREX family of instructions for processors which support them * SWP and LDREX/STREX do not interoperate * eg. can't access the same mutex with SWP and LDREX/STREX and expect it to work * When patching conditionally retain SWP support so upstream, Debian etc. and cann stil be built for older architectures === Slide 6 === * Proposal - binutils * Default architecture for gcc and as * -march-armv7-a * Default instruction set for as * -marm * - The traditional default; most out-of-line assembler (and the binutils testsuite) except this * For out-of-line assembler, it is usual to control the instruction set with directives within the assembler source. * Default floating-point configuration * -mfpu=vfpv3-d16 -mfloat-abi=softfp * Not all implementations have the 32-register variant * Not ready for hard-float ABI yet? === Slide 7 === * Proposal - gcc * Architecture baseline same as for binutils * -marhc=armv7-a -mfpu-vfp3-d16 -mfloat-abi=softfp * Default tuning for gcc * none proposed * -mtune=cortex-a8 is currently the default for -march=armv7 anyway * in the future, the upstream tools should give sensible "generic tuning" provided there is no explicit -mtunes * Qualcomm has a tool to post-process binaries to reorder NEON instructions in a more efficient way; it's probably best to consider -mtune= to implement this properly, but it seems there's room for a new tool -- similar to prelink in spirit -- which would post-process binaries after install to tune them for a particular subarch * performance implications: how long does it take to post-process binaries? Probably not much since NEON binaries should be in /lib/vfp/neon and there aren't that many; we could do it with a trigger on that dir -- no idea how fast the tool is * Default instruction set for gcc * -mthumb * Generates Thumb-2 code * Legacy inloine assembler support - most old inline asm won't compile to T2 * Pass -mimplicit-it=thumb to as * Allows the assembler to guess UAL conditional annotations from traditional ARM inline assembler when compiling to Thumb-2 * Non-optimal but "should work" for a lot of legacy inline asm. * Not so suitable for hand-optimised ARM assembler, due to non-optimal nature * Harmless / inactive in all othe rcases === Slide 8 === Processor-Specific Features in GCC * For ARM implemntations (Cortex-A8, A9), the NEON SIMD instruction set may be available on some platforms * SMD integer (8, 16, 32, 64-bit) and float (32-bit) operations * if used, need to build "with NEON", and "without NEON" code and select the right implementation at package-install time or run-time * Worth experimenting with, where support is available * -mfpu=neon -ftree-vectorize * -ftree-vectorize is turned on by default on -O3 * Also need -ffast-math for NEON floating point to be used (NEON floating-point arithmetic is not fully IEEE compliant) * Need to code carefully to get the best vectorization * C99 restricted pointers * Alignment considerations for optimal SIMD load/store performance * Third-party processor implementations may have their own special extensions From general questions time - can we implement kernel based optimised routine (memcpy for example) ala VDSO page used on PowerPC and the equivalent on x86. Rationale being may allow a single point of derivative specific optimisation. - this is currently used for gcc __sync intrinsics?