Thumb2PortingHowto

Differences between revisions 8 and 10 (spanning 2 versions)
Revision 8 as of 2010-02-11 13:55:48
Size: 28102
Editor: fw-tnat
Comment:
Revision 10 as of 2010-02-11 14:01:38
Size: 28469
Editor: fw-tnat
Comment:
Deletions are marked like this. Additions are marked like this.
Line 264: Line 264:
WRITE ME
Line 280: Line 282:
|| using the {{SWP}} instruction || {{SWP}} does an atomic memory read-write operation, analogous to {{lock xchg}} on x86. This is also quite expensive on modern platforms, since the whole system bus most be locked for the operation. {{SWP}} (and the byte-sized version {{SWPB}}) don't scale well to multicore platforms, and are deprecated from ARMv7 onwards. {{SWP}} and {{SWPB}} are not permitted in Thumb-2 code. {{SWP}} can be used to implement a simple mutex, but more complex primitives (such as atomic increment, bit set, etc.) must usually be constructed using a mutex as a base. ||

ARMv6 introduces a new mechanism, known as "exclusives", using the {{{LDREX}} and {{{STREX}}} instructions. Direct use of these instructions is not recommended for new code (see "GCC atomic primitives" for a better alternative). However, to assist with understanding existing code, a quick overview follows:
|| using the {{{SWP}}} instruction || {{{SWP}}} does an atomic memory read-write operation, analogous to {{{lock xchg}}} on x86. This is also quite expensive on modern platforms, since the whole system bus most be locked for the operation. {{{SWP}}} (and the byte-sized version {{{SWPB}}}) don't scale well to multicore platforms, and are deprecated from ARMv7 onwards. {{{SWP}}} and {{{SWPB}}} are not permitted in Thumb-2 code. {{{SWP}}} can be used to implement a simple mutex, but more complex primitives (such as atomic increment, bit set, etc.) must usually be constructed using a mutex as a base. ||

ARMv6 introduces a new mechanism, known as "exclusives", using the {{{LDREX}}} and {{{STREX}}} instructions. Direct use of these instructions is not recommended for new code (see "GCC atomic primitives" for a better alternative). However, to assist with understanding existing code, a quick overview follows:
Line 286: Line 288:
        /* do something with r0, which must not involve any store */
        /* no store operations can occur between a LDREX and its corresponding STREX */
     /* do something with r0 */
     /* no store operations can occur between a LDREX and its corresponding STREX */
Line 289: Line 291:
        /* now r2 = 0 if the new value was stored */
        /* r2 = 1 if the store was abandoned */
     /* now r2 = 0 if the new value was stored */
            /* r2 = 1 if the store was abandoned */
Line 294: Line 296:

For example:
Line 307: Line 311:

Spinlocks are relatively straightforward to implement using these primitives; but many other issues need to be taken into account in order to implement a robust multithreaded primitive such as a conventional semaphore or mutex.

Generally it is preferable to use library functionality rather than reinventing sophisticated primitives.

UNDER CONSTRUCTION

When you see some assembler in a source package, there are some things which you need to consider when porting for Thumb-2 compatibility.

[FIXME - we really need examples to illustrate this page! When some packages have already been looked at, links should be added here]

How to Port Packages

[WRITE ME] [find the affected bits of code] [work out the implications and fix, depending on the issue types]

Key Thumb-2 compatibility Issues

Here's a quick breakdown of the key issues which may need attention

Procedure calls and returns

When using Thumb-2, the system will generally contain a mixture of ARM and Thumb-2 functions (depending on how libraries and binaries, and their component objects and functions, were assembled).

The processor does not automatically know which instruction set is used for the code being executed after a branch, procedure call or procedure return --- instead, it must be told which instruction set to use at the time of the branch or return.

Getting this right is known as "interworking".

For C code, it's magic and will "just work", but for assembler, when you need to jump around, you need to do it the right way and not the wrong way... otherwise the processor will try to interpret the code using the wrong instruction set and sooner or later crash the running process (it certainly won't be doing what the programmer intended... much as if you branched into some arbitrary data).

How it works

The target instruction set state is determined is different ways depending on the type of branch.

Note that the optional condition code at the end of each instruction mnemonic is omitted.

  • b <label> no switch

    • This is (usually) the right way to do a non-returning jump to a label or function.
    • if <label> is an external symbol defined elsewhere, the linker will magically patch it up to switch appropriately.

    • Caution is required if ARM and Thumb are mixed in a single source file (rare), since there is no automatic instruction set switch for local symbols (in this unlikely case you may need to use bx instead). The assembler may or may not magically introduce a veneer/trampoline depending on whether or not it knows that the destination is in a different instruction set and is definitely a code symbol (à la .type <symbol>, %function or .thumb_func). Note that just because a symbol appears in a code section it is not assumed to be a code symbol unless specifically tagged in one of the aforementioned ways.

  • bl <label> no switch

    • This is (usually) the right way to do a procedure call to a label or function.
    • if <label> is an external symbol defined elsewhere, the linker will magically patch it up to switch appropriately --- but this does not guarantee a successful return unless the called function does a correct interworking return.

    • The link register (LR or r14) is automatically set to the return address, just follwing the call. The bottom bit of LR is automatically set to 0 (ARM) or 1 (Thumb) to indicate which instruction set to switch to when returning. (This is not automatically done if you use an instruction like mov lr, pc to determine the return address --- this can lead to problems when the called function returns.)

    • Caution is required if ARM and Thumb are mixed in a single source file (rare). As for b <label>, the assembler may or may not magically introduce a veneer/trampoline depending on whether or not it knows that the destination is in a different instruction set and is definitely a code symbol (à la .type <symbol>, %function or .thumb_func).

  • bx <register> (register is usually lr): switches depending on bottom bit of <register>

    • The is one of the two right ways to do a procedure return.
    • Assumes the LR value (or whatever <register> is used) has the bottom bit set correctly to indicate ARM or Thumb (which will be the case it it comes from the LR value set by correct procedure call)

    • It is not supported on ARMv4, so for Debian compatibility a workaround is needed. Note that bx lr is better for performance on newer processors, since mov pc, lr may harm branch prediction performance.

#ifdef (___ARM_ARCH_4T__) || defined (__ARM_ARCH_4__)
        "mov    pc, lr"
#else
        "bx     lr"
#endif
  • blx <register> (register should usually not be lr): saves the return address in lr and calls a function at the address held in the specified <register>. Switches instruction set depending on bottom bit of <register>.

    • This is the preferred way to do a procedure call to a computed or variable address.
    • The called function must still do a correct return.
    • This instruction is not supported on ARMv4(T), so code intended to build and work on Debian must use an alternative workaround. Because Debian does not use Thumb code, the following snippet is usually sensible (see "Computed destinations and returns" for an explanation of why this isn't safe for Thumb code, though):

#if defined (___ARM_ARCH_4T__) || defined (__ARM_ARCH_4__)
        "mov    lr, pc\n\t"
        "mov    pc, <register>"
#else
        "blx    <register>"
#endif
  • ldr pc, [...] or pop {..., pc} or ldmfd sp!, {..., pc}: switches depending on the bottom bit of the value loaded for PC.

    • The other right way to do a procedure return.
    • Assumes you saved the LR value as part of the function prologue.
    • Assumes the LR value has the bottom bit set correctly to indicate ARM or Thumb (which will be the case of the procedure call was done in the right way)
    • Debian-compatible
    • Additional registers can be restored from the stack as part of the return, in the usual way.
  • mov pc, <register>: no switch unless executed from ARM code AND the processor is >= ARMv7

    • Often the right way to implement inline jump tables (see "PC arithmetic and position-independent addressing")
    • Debian-compatible way to do a procedure return or computed/variable branch (ARM only)
    • Not recommended for procedure returns on ARMv7.
    • Generally, it's best to avoid relying on the interworking behaviour of this instruction, since newer ARM processors are not all optimised to do efficient branch-prediction in this case, so something like this is preferable:

#ifdef (___ARM_ARCH_4T__) || defined (__ARM_ARCH_4__)
        "mov    pc, lr"
#else
        "bx     lr"
#endif

Computed destinations and returns

Whenever a destination or return address is variable or otherwise determined at run-time, you need to be careful to set the "thumb bit" (bit 0) in the address correctly and/or do the correct type of branch, to make sure that the call (and return, if applicable) switch instruction set appropriately.

labels or functions:

  • If you reference an external label or function defined in another object, the linker will magically give you an address with the "Thumb bit" (bit 0) set appropriately. This means you can branch to it safely with bx or blx, or store it in memory and load it into PC later, pass it to other functions as a callback, etc.

  • If you reference an symbol internal to the object, life is more "interesting":
    • If the symbol is a C function the Thumb bit in the address you get will be set appropriately.
    • If the symbol is tagged using an assembler directive .type <symbol>, %function or .thumb_func, the Thumb bit in the address you get will be set appropriately. (GCC always does this for C functions.)

    • Otherwise, the Thumb bit will not be set appropriately.

    • When referencing GNU assembler local labels (0b, 1f etc.) the Thumb bit will not be set appropriately.

".":

  • The current assembly location symbol in the GNU assembler (.) never has the Thumb bit set.

  • This behaviour is usually useful but sometimes unexpected. As a consequence, code like ldr r0, =. ; bx r0 is not an infinite loop in Thumb --- instead, it would re-execute the instructions as ARM any will probably crash the process.

  • Note that because b and bl do not switch instruction state, subs r0, r0, #1 ; bne . - 2 will work as a simple delay loop in Thumb (but you should never write it this way; see "PC and . arithmetic and position-independent addressing").

In cases where the Thumb bit is not set appropriately, it will simply be left as 0. For this reason, the distinction is not important when executing in ARM (where no instruction set change is implied by the Thumb bit), but is important in Thumb (where there may be an unintentional switch to ARM if you don't take corrective action).

The general rule is as follows: if the address will be passed to any other function or object (as a return address, method address, callback etc.) then you must ensure that the Thumb bit gets set if the code is assembler in Thumb.

However, if the little bit of code you're hacking knows that the Thumb bit is never set in the address, it may be safe not to set it so long as you bear this in mind. This can make sense for inline jump tables etc. -- see "Jump tables"

PC and "." arithmetic and position-independent addressing

Introduction

ARM instructions cannot pack an arbitrary address into any instruction as an operand. For this reason, and if position-independence is required, a common coding technique is to address some data using offsets relative to the current instruction location.

On ARM, you can (sort of) read PC as if it were an ordinary register, and this can be used to determine the currently executing address though some adjustment is needed. This allows various position-independent coding tricks; however, the value you observe when "reading" the PC does not always match ARM and Thumb, and is not the same in all circumstances:

Instruction Set State

Usage Context

Instruction Example

Observed Value of PC

ARM

most instructions

address of instruction + 8

ARM

instructions which store PC to memory

STR pc, [sp, #4]

address of instruction + 8 or 12 (implementation-dependent, but consistent for a given device; avoid relying on this behaviour)

Thumb

source operand in ADD, SUB or MOV

ADD r1, pc, #5

address of instruction + 8

Thumb

source operand in ADD, or base register in load or store

LDR r2, [pc, #16]

(address of instruction & ~3) + 4

Thumb

source operand in SUB or MOV

SUB r2, #8

address of instruction + 4

Thumb

most other uses

undefined

address of instruction is equivalent to the value of the magic assembler location counter symbol "." if it appears in the same instruction, i.e. the virtual address of the first byte of the currently exceuting instruction (which in ARM is always a multiple of 4, and in Thumb is always a multiple of 2).

From the above, you can probably guess that it is hard to write code which will work in both ARM and Thumb if you reference the PC directly. Unfortunately, it is quite common to see such explicit references in hand-written traditional ARM assembler.

Some examples follow, along with explanation of how to make them Thumb-2 compatible.

Unless specified otherwise, the suggested workarounds are compatible with all assembler versions, even for assemblers which may not support Thumb-2.

Typical uses - loading a literal from the text section

Traditional way

        LDR     r0, [pc, #(data - . - 8)]
...
data:
        .long 0xdeadbeef

What happens in ARM

pc == . + 8, so r0 is loaded from the address (. + 8) + data - . - 8, = data

What happens in Thumb

pc == (. & ~3) + 4, so r0 is loaded from the address ((. & ~3) + 4) + data - . - 8, = data - 4 - (. % 2)

...which is unlikely to be what the programmer intended...

How to resolve it

If loading from a local text section label, don't try to be clever with explicit PC arithmetic, just use LDR <reg>, <label> syntax:

        LDR     r0, data

In ARM, the assembled output will be the same as what the programmer wrote in the first place.

In Thumb, the assembler will work out the correct PC offset for you, which is guaranteed to be stable since code sections are always marked as 4-byte aligned in object files.

Typical uses - getting the address of local data in the text section

Sometimes, code needs to get the address of local data in the text section, to pass to another function, or to perform table-driven jumps etc.

Note: this section only applies to obtaining the address of data only. For details on getting the address of code, see elsewhere.

Traditional way

Programmers who are not aware not aware of the ADR pseudo-instruction may write code like this:

        ADD     r1, pc, #(data - . - 8)
...
data:

What happens in ARM

pc == . + 8, so r1 is is set to (. + 8) + data - . - 8, = data

What happens in Thumb

pc == (. & ~3) + 4, so r1 is set to ((. & ~3) + 4) + data - . - 8, = data - 4 - (. % 2)

This is probably wrong.

Even more confusingly, the adjustment for a PC-based SUB instruction in Thumb-2 is different from that for an ADD instruction, so the result will be different again if you try to reference a label occurring before the ADD instruction --- the assembler may transform the ADD into a SUB in this case.

How to resolve it

There is a special pseudo-instruction for getting the address of

        ADR     r1, data
...
data:

The assembler resolves ADR to a PC-based ADD or SUB, with appropriate adjustments to the offset.

Typical uses - jump tables and local jumps

Traditional way

What happens in ARM

What happens in Thumb

How to resolve it

Typical uses - callbacks

Sometimes, you want to get the address of a function or piece of code to use as a callback, either by passing it to another function or by storing it in a structure where it may be read and used by some different code.

Traditional way

In ARM code you didn't need to care about how you got the address, because all methods work. As a silly example, suppose we're trying to register an assembler cleanup function with atexit(3)

Method 1: literal load

        LDR     r0, =func
        BL      atexit

func:
        <do something> ...

Method 2: PC-relative

        ADR     r0, func        @ Remember, this is a pseudo-op for ADD/SUB r0, pc, #<constant determined by the assembler>)
        BL      atexit

func:
        <do something> ...

What happens in ARM

In ARM, both methods "just work".

Method 2 is only allowed for nearby labels in the same section of the same source file, but it can be more compact and efficient.

What happens in Thumb

In Thumb, there are some things to bear in mind:

  • ADR never sets the "Thumb bit". You could add it yourself using ADR r0, func | 1 but then you need #ifdefs or some other special case magic depending on whether the code is built in ARM or Thumb, which is ugly and harder to maintain.

  • For external symbols, LDR <reg>, <symbol> will give you a value with the Thumb bit set appropriately.

  • For local symbols, LDR <reg>, <symbol> will not set the Thumb bit unless the symbol is designated as a function symbol.

How to resolve it

        LDR     r0, =func
        BL      atexit

.type func, %function
func:
        <do something>

This ensures that the Thumb bit in the address gets set, if you happen to compile the code in Thumb. If you compiled in ARM, the Thumb bit will be clear, as normal. Provided that the code which uses the value performs a correct interworking function call, it doesn't need to know whether your assembler was ARM or Thumb when calling back into it.

Jump tables

WRITE ME

The SWP instruction

The SWP instruction performs a locked read-write operation on a piece of memory, similar to the x86 xchg instruction. This can have a bad impact on performance in modern systems with a complex hardware architecture and/or multiple processors or other bus masters, so this instruction is deprecated.

Because the Thumb-2 instruction set was introduced after SWP became deprecated, there is no encoding for SWP in Thumb-2 at all; so SWP is not allowed when building for Thumb-2; this will lead to build failures.

See "Atomic Operations" for a more general discussion of how to port these cases.

Atomic operations and SMP safety (SWP, LDREX, STREX; memory barriers)

A number of packages contain homegrown atomic operation implementations.

Atomic operation primitives

Prior to version 6 of the ARM architecture, there were only two ways to do atomic operations:

method

notes

via the kernel

The kernel can ensure that no other user thread interrupts the operation if necessary, but transitioning in and out of the kernel is relatively expensive and slow.

using the SWP instruction

SWP does an atomic memory read-write operation, analogous to lock xchg on x86. This is also quite expensive on modern platforms, since the whole system bus most be locked for the operation. SWP (and the byte-sized version SWPB) don't scale well to multicore platforms, and are deprecated from ARMv7 onwards. SWP and SWPB are not permitted in Thumb-2 code. SWP can be used to implement a simple mutex, but more complex primitives (such as atomic increment, bit set, etc.) must usually be constructed using a mutex as a base.

ARMv6 introduces a new mechanism, known as "exclusives", using the LDREX and STREX instructions. Direct use of these instructions is not recommended for new code (see "GCC atomic primitives" for a better alternative). However, to assist with understanding existing code, a quick overview follows:

        LDREX   r0, [r1]
            /* do something with r0 */
            /* no store operations can occur between a LDREX and its corresponding STREX */
        STREX   r2, r0, [r1]
            /* now r2 = 0 if the new value was stored */
            /* r2 = 1 if the store was abandoned */

Because the STREX is allowed to "fail", it is no longer necessary to lock out other bus activity, or stop other threads from executing (whether on the same CPU or on another CPU in a multiprocessor system); atomic use in a thread can preempt a pending atomic in another thread. For this reason, these primitives are usually also placed in a loop.

For example:

/* atomically increment a word: */

try_to_increment:
        LDREX   r0, [r1]
        ADD     r0, r0, #1
        STREX   r2, r0, [r1]
        CMP     r2, #0
        BNE     try_to_increment

Note: The mechnisms used to make exclusives work are completely different from those used by SWP, so when porting code that accesses a particular instance of an atomic object, you must port all the code accessing that object, not just some of it. This is not usually a problem because a given atomic object will generally be managed by a single piece of library code shared between the threads or processes accessing it.

Spinlocks are relatively straightforward to implement using these primitives; but many other issues need to be taken into account in order to implement a robust multithreaded primitive such as a conventional semaphore or mutex.

Generally it is preferable to use library functionality rather than reinventing sophisticated primitives.

Memory access ordering and memory barriers

ARMv6 and later have a weakly-ordered memory model.

This means that is a processor performs a sequence of memory accesses, then a second processor or other bus master (DMA controller etc.) is not guaranteed to see the memory accesses happen in the same order.

This is a particular concern when performing atomic operations, since atomics usually exist to protect simultaneous access to some data structure or some other resource. Consider the following example which attempts to reset a 128-bit counter:

reset_counter:
        BL      get_mutex

        /* ! */

        /* now we modify the object protected by the mutex: */
        LDR     r1, =object
        MOV     r0, 0
        STR     r0, [r1]
        STR     r0, [r1, #4]
        STR     r0, [r1, #8]
        STR     r0, [r1, #12]

        /* ! */

        BL      release_mutex

Due to caching effects, a second processor might see the mutex get acquired after some of the STR instructions have completed, which can lead to the counter being left in an unexpected state. Note that replacing the STR instructions with a single instruction such as STMIA does not necessarily solve this, since a single instruction can still be split into multiple underlying memory accesses.

To solve this, it is necessary to ensure everything in the system sees the memory accesses which form part of get_mutex occur before the object is modified, and sees all the memory accesses which modify the object before release_mutex is called.

The solution is to place a Data Memory Barrier instruction at the locations marked /* ! */ to ensure the correct access order.

Clearly, the ideal place to do this is in the implementation of get_mutex and release_mutex --- which is why SMP safety must be considered carefully when implementing such operations (even if LDREX and STREX are used to implement them).

The ARM Architecture Reference Manual contains a full discussion, but generally it is more robust and portable to rely on libraries for atomic operation facilities.

GCC atomic primitives

WRITE ME

In the meantime, search for sync_synchronize and friends in the GCC info docs.

The formal specification of the atomics can be found at http://refspecs.freestandards.org/elf/IA64-SysV-psABI.pdf

Operand combinations

Thumb-2 is generally a bit more restricted with regard to instruction operands. You may find that you get assembler errors when assembling for Thumb-2

Use of PC and SP

Note:

  • PC may also be denoted by "r15" in assembler source.
  • SP may also be denoted by "r13" in assembler source.

Generally, doing fancy stuff with the program counter and stack pointer registers is deprecated in ARMv7, and may not be allowed in Thumb-2 at all. Existing code which does some things will generally need some porting (though it is generally safe not to worry if you don't get errors or warnings when building).

If it sounds from the register's name like it may not have been designed for what you're doing, you are probably doing "fancy stuff" and should avoid it... Wink ;)

Details:

Stack Pointer (SP) register

For SP, you can push, pop, ldmfd sp!,..., stmfd sp!,... or add or sub or mov, but you should generally avoid doing anything else, and do not use the as a destination register in other operations, or attempt to push or pop sp itself from the stack. Older assemblers may not accept push and pop in ARM code; this may be (but probably is not) an issue for Debian compatibility.

Using SP as a base register for load and store operations is allowed, and you may add an offset to the address, but you may not multiply/shift/scale SP.

Upwards-growing stacks (ldmea sp!, ..., stmea sp!, ... etc.) are deprecated, but it is rare to encounter these.

Program Counter (PC) register

For PC, you cannot use it as the destination register in most operations, except for mov, ldr, and pop {..., pc} or ldmfd sp!, {..., pc}.

Using the PC as a source register in simple operations (add <reg>, pc, ..., sub <reg>, pc, ..., or mov <reg>, pc) is allowed but may produce different results in Thumb compared with ARM, and non-Thumb-aware code which does these things will need to be ported. In particular, attempts to manually determine a return address (mov lr, pc or similar) or index inline jump tables (ldr pc, [pc, <index>]) or similar may need attention.

Using the PC as a base address register (i.e., the first operand inside the brackets [pc ...]) is allowed in simple load and store operations, but you may not multiply/shift/scale or auto-update the PC (! or [pc],<index> syntax). Again, the results may be different between ARM and Thumb.

See "PC arithmetic and position-independent addressing" for more detail on this and how to handle these cases.

You should generally not push PC onto the stack or store it via push, str, stmfd etc.; some of these operations may not be allowed in Thumb-2, and results may differ between ARM and Thumb. However, loading PC from the stack is explicitly allowed as one of the "correct" ways of doing a procedure return - see "Procedure calls and returns".

ARM versus Thumb

It's important to understand what kind of assembler you're looking at, and the instruction set it will be assembled for. The primary reason for this is to understand the requirements for interworking (function calls between ARM and Thumb code) to work properly.

There are a few possibilities here:

Traditional ARM assembler

  • out-of-line assembler files (.s, .S)

  • the following directives are not present in the source: .code 16, .thumb, .thumb_func, .syntax unified

  • 3- or 4-operand instructions present (e.g., add r0, r1, #3) and arbitrary conditional instructions (e.g., subgt r1, r4, r5)

  • assembles to fixed-size 32-bit instructions

Unified assembler

  • out-of-line assembler files (.s, .S)

  • the following directive is present in the source: .syntax unified

  • code looks similar to traditional ARM assembler
  • hashes (#) in front of immediate operands are not required and may be absent

  • except for conditional branches, conditional instruction sequences must be preceded immediately by it directives (such as itte eq / moveq r0, r1 / subeq r2, 1 / movne r0, 0

  • assembles either to fixed-size 32-bit (ARM) instructions, or mixed-size (16-/32-bit) Thumb-2 instructions, depending on the presence of .code, .thumb, .arm directives etc.

"Hybrid" assembler

  • all GCC inline assembler (.c, .h, .cpp, .cxx, .c++ and so on)

  • code intended to build for Thumb-2 (lucid default) or ARM, depending on GCC configuration and command-line switches (-marm, -mthumb)

  • code should be understandable as tradational ARM assembler and unified assembler

  • it blocks may be present (but are inferred otherwise --- may be better to omit them unless it is critial for performance, for compatibility with older Debian tools etc.)

  • hashes (#) in front of immediate integer constants should be present.

  • assembles either to fixed-size 32-bit (ARM) instructions, or mixed-size (16-/32-bit) Thumb-2 instructions (lucid default), depending on the GCC configuration and command-line options (-marm, -mthumb).

Traditional Thumb-1 assembler (rare)

  • You probably won't see any of this!
  • out-of-line assembler files (.s, .S)

  • any the following directives are present in the source: .code 16, .thumb, .thumb_func

  • the following directive is not present in the source: .syntax unified

  • mostly 2-operand instructions (add or sub can sometimes have 3)

  • all instructions are unconditional except for branches (beq etc.)

ARM/Thumb2PortingHowto (last edited 2011-06-08 11:09:58 by host109-145-98-191)