Lucid will drop support for older arm specification in favour of armv7 with thumb2 and softfp.
Ubuntu targetted devices for which the armv7 architecture is the lowest reasonable bound; hence taking advantage of the improved instruction set of such modern applications is essential as it will come with notable speed and size improvements.
Starting the Lucid cycle the toolchain will get default options optimized for our new target architecture: ARMv7. From there on there will be a continous effort to track and fix regressions and build failures in user-space applications. Detailed information about the changed compiler flags together with initial porting instructions are documented on the ARM/Thumb2 wiki page.
- see work items documented on blueprint page (linked from above)
- put your comments/notes here (dont forget a name).
ARM presentation Slides
Make gcc default to ARMv7 and Thumb2 on the ARM architecture
- Lucid moved to armv7 + thumb2 support.
- Karmic moved to ARMv6 + VFP2-D16
- Kernel currently builds for ARMv7, but unknown if it is using thumb in the builds
- See ARM() and THUMB() macros, controlled by CONFIG_THUMB2_KERNEL
- Packages with inline assembler will likely explode.
- * VFP available to ARM and Thumb sides
- Newest revision of the ARM architecture
- For Ubuntu targeting armv7-a
- Floating-point architecture which accompanies ARMv7
- -mfpu=vfp3-d16 -mfloat-abi=softfp
What about -mfpu=vfp3-d16-fp16 implemented by CodeSourcery's toolchain and
- GCC 4.5? is this something we want?
- superset of the VFPv2 floating-point instruction set supported by karmic
- Has 16 double precision registers, like VFP2
- cortex a8/a9 implements 32 double precision registers, but some armv7a implementations do not
- what about -mfloat-abi=hard?
- support present in gcc trunk
- Is there a triplet for hard float? Would be good to have
- We might be able to support both with multiarch
- expected in 4.5 release branch (not yet branched)
- better performance and more register bandwidth
- Would be an ABI break with Debian armel architecture
- * Might also break compatibility with Ubuntu itself, may requiring rebootstrapping the port
- some asm (slide changed before edit complete)
- thumb-2 enabled with -mthumb -march=armv7-a (or greater)
* superset of the compact thumb instruction set * ssupported in all armv7 implementations * performance similar to ARM
- (slide changed before edit complete)
* Thumb-2 uses the new "Unified Assembly Language" syntax (extension of ARM syntax) * UAL generated automatically by GCC when building for Thumb-2 * Hand writted UAL can also be assembled direction to ARM or Thumb-2 code without modification
- .syntax unified enables new syntax in GNU as
- note that thumb-2 has some restrictions on operand combinations and ranges which differ from ARM
* enamine output of gcc -march=armv7-a -mthumb -S to see examples * GNU as (Karmic or newer) can also process legacy ARM assembler as if it were UAL
- gcc -mthumb -Wa,-implicit-it=thumb
- non-optimal but will compile most legacy ARM inline ASM into Thumb-2
- SWP instruction
- Simutaneous bus-locked red and write of memory location
- Was used to implement atomic operations (mutexes etc.)
- Deprecated since ARMv6
- Bad performance impact for complex architectures, especially multi-core
- Not supported at all in Thumb2
- Seek and Destroy!
- Implementation in multi-core ARMv7 processors and not guaranteed to be SMP-Safe (Cortex A9 etc.)
- Where possible, port to gcc intrinsics
sync_val_compare_and_swap and friends, see gcc info pages
- In other cases, need to port to LDREX/STREX family of instructions for processors which support them
- SWP and LDREX/STREX do not interoperate
- eg. can't access the same mutex with SWP and LDREX/STREX and expect it to work
- When patching conditionally retain SWP support so upstream, Debian etc. and cann stil be built for older architectures
- Simutaneous bus-locked red and write of memory location
- Proposal - binutils
- Default architecture for gcc and as
- Default instruction set for as
- - The traditional default; most out-of-line assembler (and the binutils testsuite) except this
- For out-of-line assembler, it is usual to control the instruction set with directives within the assembler source.
- Default floating-point configuration
- -mfpu=vfpv3-d16 -mfloat-abi=softfp
- Not all implementations have the 32-register variant
- Not ready for hard-float ABI yet?
- Proposal - gcc
- Architecture baseline same as for binutils
- -marhc=armv7-a -mfpu-vfp3-d16 -mfloat-abi=softfp
- Default tuning for gcc
- none proposed
- -mtune=cortex-a8 is currently the default for -march=armv7 anyway
- in the future, the upstream tools should give sensible "generic tuning" provided there is no explicit -mtunes
- Qualcomm has a tool to post-process binaries to reorder NEON instructions in a more efficient way; it's probably best to consider -mtune= to implement this properly, but it seems there's room for a new tool -- similar to prelink in spirit -- which would post-process binaries after install to tune them for a particular subarch
- performance implications: how long does it take to post-process binaries? Probably not much since NEON binaries should be in /lib/vfp/neon and there aren't that many; we could do it with a trigger on that dir -- no idea how fast the tool is
- Default instruction set for gcc
- Generates Thumb-2 code
- Legacy inloine assembler support - most old inline asm won't compile to T2
- Pass -mimplicit-it=thumb to as
- Allows the assembler to guess UAL conditional annotations from traditional ARM inline assembler when compiling to Thumb-2
- Non-optimal but "should work" for a lot of legacy inline asm.
- Not so suitable for hand-optimised ARM assembler, due to non-optimal nature
- Harmless / inactive in all othe rcases
Processor-Specific Features in GCC
- For ARM implemntations (Cortex-A8, A9), the NEON SIMD instruction set may be available on some platforms
- SMD integer (8, 16, 32, 64-bit) and float (32-bit) operations
- if used, need to build "with NEON", and "without NEON" code and select the right implementation at package-install time or run-time
- Worth experimenting with, where support is available
- -mfpu=neon -ftree-vectorize
- -ftree-vectorize is turned on by default on -O3
- Also need -ffast-math for NEON floating point to be used (NEON floating-point arithmetic is not fully IEEE compliant)
- Need to code carefully to get the best vectorization
- C99 restricted pointers
- Alignment considerations for optimal SIMD load/store performance
- Third-party processor implementations may have their own special extensions From general questions time - can we implement kernel based optimised routine (memcpy for example) ala VDSO page used on PowerPC and the equivalent on x86. Rationale being may allow a single point of derivative specific optimisation.
- this is currently used for gcc sync intrinsics?