ARMv7AndThumb

Revision 2 as of 2009-11-16 20:08:12

Clear message

== Make gcc default to ARMv7 and Thumb2 on the ARM architecture ==
 * Lucid moved to armv7 + thumb2 support.
 * Karmic moved to ARMv6 + VFP2-D16
 
 * Kernel currently builds for ARMv7, but unknown if it is using thumb in the builds
  * See ARM() and THUMB() macros, controlled by CONFIG_THUMB2_KERNEL
 * Packages with inline assembler will likely explode.
 ** VFP available to ARM and Thumb sides
 
  ARM presentation Slide 1
 
 * Newest revision of the ARM architecture
 * For Ubuntu targeting armv7-a
 
 Slide 2
 
 Floating-point architecture which accompanies ARMv7
   * -mfpu=vfp3-d16 -mfloat-abi=softfp
   * What about -mfpu=vfp3-d16-fp16 implemented by CodeSourcery's toolchain and
     GCC 4.5?  is this something we want?
   * superset of the VFPv2 floating-point instruction set supported by karmic
   * Has 16 double precision registers, like VFP2
   * cortex a8/a9 implements 32 double precision registers, but some armv7a implementations do not
   * what about -mfloat-abi=hard?
        * support present in gcc trunk
        * Is there a triplet for hard float?  Would be good to have
        * We might be able to support both with multiarch
        * expected in 4.5 release branch (not yet branched)
        * better performance and more register bandwidth
        * simplified JIT's (Java, JavaScript, Flash ...)
        * Would be an ABI break with Debian armel architecture
        ** Might also break compatibility with Ubuntu itself, may requiring rebootstrapping the port
        * some asm 
        
        (slide changed before edit complete)
        
Slide 3

* -mthumb
        *thumb-2 enabled with -mthumb -march=armv7-a (or greater)
* superset of the compact thumb instruction set
* ssupported in all armv7 implementations
* performance similar to ARM
        (slide changed before edit complete)

Slide 4

* Thumb-2 uses the new "Unified Assembly Language" syntax (extension of ARM syntax)
* UAL generated automatically by GCC when building for Thumb-2
* Hand writted UAL can also be assembled direction to ARM or Thumb-2 code without modification
        *.syntax unified enables new syntax in GNU as
        *note that thumb-2 has some restrictions on operand combinations and ranges which differ from ARM
* enamine output of gcc -march=armv7-a -mthumb -S to see examples
* GNU as (Karmic or newer) can also process legacy ARM assembler as if it were UAL
        * gcc -mthumb -Wa,-implicit-it=thumb
        * non-optimal but will compile most legacy ARM inline ASM into Thumb-2
        
Slide 5

 * SWP instruction
  * Simutaneous bus-locked red and write of memory location
   * Was used to implement atomic operations (mutexes etc.)
  * Deprecated since ARMv6
   * Bad performance impact for complex architectures, especially multi-core
   * Not supported at all in Thumb2
  * Seek and Destroy!
  * Implementation in multi-core ARMv7 processors and not guaranteed to be SMP-Safe (Cortex A9 etc.)
  * Where possible, port to gcc intrinsics
    * __sync_val_compare_and_swap and friends, see gcc info pages
  * In other cases, need to port to LDREX/STREX family of instructions for processors which support them
  * SWP and LDREX/STREX do not interoperate
   * eg. can't access the same mutex with SWP and LDREX/STREX and expect it to work
   * When patching conditionally retain SWP support so upstream, Debian etc. and cann stil be built for older architectures
   
 Slide 6
  * Proposal - binutils
  * Default architecture for gcc and as
   * -march-armv7-a
  * Default instruction set for as
   * -marm
   * - The traditional default; most out-of-line assembler (and the binutils testsuite) except this
   * For out-of-line assembler, it is usual to control the instruction set with directives within the assembler source.
  * Default floating-point configuration
   * -mfpu=vfpv3-d16 -mfloat-abi=softfp
   * Not all implementations have the 32-register variant
   * Not ready for hard-float ABI yet?
   
Slide 7
 * Proposal - gcc
 * Architecture baseline same as for binutils
  * -marhc=armv7-a -mfpu-vfp3-d16 -mfloat-abi=softfp
 * Default tuning for gcc
  * none proposed
  * -mtune=cortex-a8 is currently the default for -march=armv7 anyway
  * in the future, the upstream tools should give sensible "generic tuning" provided there is no explicit -mtunes
  * Qualcomm has a tool to post-process binaries to reorder NEON instructions in a more efficient way; it's probably best to consider -mtune= to implement this properly, but it seems there's room for a new tool -- similar to prelink in spirit -- which would post-process binaries after install to tune them for a particular subarch
   * performance implications: how long does it take to post-process binaries?  Probably not much since NEON binaries should be in /lib/vfp/neon and there aren't that many; we could do it with a trigger on that dir -- no idea how fast the tool is
 * Default instruction set for gcc
  * -mthumb
  * Generates Thumb-2 code
 * Legacy inloine assembler support - most old inline asm won't compile to T2
  * Pass -mimplicit-it=thumb to as
  * Allows the assembler to guess UAL conditional annotations from traditional ARM inline assembler when compiling to Thumb-2
  * Non-optimal but "should work" for a lot of legacy inline asm.
  * Not so suitable for hand-optimised ARM assembler, due to non-optimal nature
  * Harmless / inactive in all othe rcases
  
Slide 8
Processor-Specific Features in GCC
 * For ARM implemntations (Cortex-A8, A9), the NEON SIMD instruction set may be available on some platforms
  * SMD integer (8, 16, 32, 64-bit) and float (32-bit) operations
  * if used, need to build "with NEON", and "without NEON" code and select the right implementation at package-install time or run-time
 * Worth experimenting with, where support is available
 * -mfpu=neon -ftree-vectorize
  * -ftree-vectorize is turned on by default on -O3
  * Also need -ffast-math for NEON floating point to be used (NEON floating-point arithmetic is not fully IEEE compliant)
 * Need to code carefully to get the best vectorization
  * C99 restricted pointers
  * Alignment considerations for optimal SIMD load/store performance
 * Third-party processor implementations may have their own special extensions
 
 
 
 From general questions time - can we implement kernel based optimised routine (memcpy for example) ala VDSO page used on PowerPC and the equivalent on x86.  Rationale being may allow a single point of derivative specific optimisation.
  - this is currently used for gcc __sync intrinsics?