Revision 40 as of 2011-05-31 09:45:48

Clear message

When you see some assembler in a source package, there are some things which you need to consider when porting for Thumb-2 compatibility. This page aims to highlight the main issues and help would-be porters to get started.

How to Port Packages

[WRITE ME] [find the affected bits of code] [work out the implications and fix, depending on the issue types]

IRC discussion highlighting some of these issues:

The specification for atomic primitives implemented by GCC can be found here:

General ARM documents are available at

For the instruction set "quick reference cards", browse to:

  • Browse to ARM Architecture -> Instruction Set Quick Reference Cards

  • For ARM procedure call standard, browse to RealView software development tools -> Application Binary Interface

See also for postings of interest to software developers.

ARM Assembler Overview

I could spend a long time writing one, but there's plenty of resources on the web. Most are a bit out of date, but this one isn't too bad:

Warning: the above page is a good introduction, but some things are no longer true for newer ARM architecture versions (such as ARMv7). The information on this wiki page takes precedence.

Key Thumb-2 compatibility Issues

Here's a quick breakdown of the key issues which may need attention

Identifying the Target Architecture

Because you're trying to port ARM assembler, it is assumed here that the affected source files or build system have already detected that you are building for an ARM processor of some sort.

How to tell which ARM architecture variant is being targeted takes a bit more work. Generally, GCC defines one macro depending on the targeted variant:



BX supported

Thumb Variant supported

Distro Support





armel (lenny)





ARMv2 or ARMv2a


















(including StrongARM)




Debian runtime baseline for the armel arch


ARMv4T (including ARM7TDMI) )


Thumb 1



Debian build baseline for the armel arch (Thumb not used)


ARMv5 non-Thumb variants








ARMv5 Thumb variants (including Xscale)


Thumb 1




Ubuntu jaunty build baseline is ARMv5TE




ARMv6 and variants


Thumb 1





Ubuntu karmic build baseline is ARMv6 + VFPv2








Thumb 2



not officially

not officially



ARMv7 variants common subset


Thumb 2






not applicable; ARM instruction set not guaranteed to be supported


ARMv7-A (Applications profile, including Cortex-A8, Cortex-A9)


Thumb 2






Ubuntu lucid build baseline is ARMv7-A + VFPv3-D16 + Thumb-2


ARMv7-R (Real-Time profile)


Thumb 2



not officially

not officially

not officially

embedded only; not applicable


ARMv7-M (Microcontroller profile)


Thumb 2






not applicable; ARM instruction set not supported

In addition, the following macros are used to indicate the architecture family and target instruction set:




Always defined when building for the ARM architecture (regardless of target instruction set)


Defined only when targeting any Thumb instruction set variant (Thumb-1 or Thumb-2)


Defined only when targeting Thumb-2

The changes required for Thumb-2 compatibility would provoke build failures when building for older architectures.

Unfortunately, the __ARM_ARCH_* macros are mutually exclusive, which makes it difficult to test whether the architectural features required by a particular code snippet are supported, especially in a forwards-compatible manner.

However, when modifying the Ubuntu archive and upstream, we only really need to worry about build-time compatibility with Ubuntu releases and Debian (since Debian uses the oldest build baselines among all active distros for ARM, AFAIK).

This suggests the following most general #ifdef test to use when making Thumb-2 porting changes:

#if defined(__ARM_ARCH_4__) || defined(__ARM_ARCH_4T__)
        /* traditional ARM-compatible code */
        /* Thumb-2 compatible code, which should only use features compatible with both Thumb-2 and ARMv5 */

If the above causes problems for the Debian community, it may be preferable to use this more comprehensive test:

#if defined(__ARM_ARCH_3__) || defined(__ARM_ARCH_3M__) || defined(__ARM_ARCH_4__) || defined(__ARM_ARCH_4T__)
        /* traditional ARM-compatible code */
        /* Thumb-2 compatible code, which should only use features compatible with both Thumb-2 and ARMv5 */

...however, this starts to become fragile and cumbersome, and may be overkill; it depends on the level of activity of the old Debian arm port.

Procedure calls and returns

When using Thumb-2, the system will generally contain a mixture of ARM and Thumb-2 functions (depending on how libraries and binaries, and their component objects and functions, were assembled).

The processor does not automatically know which instruction set is used for the code being executed after a branch, procedure call or procedure return --- instead, it must be told which instruction set to use at the time of the branch or return.

Getting this right is known as "interworking".

For C code, it's magic and will "just work", but for assembler, when you need to jump around, you need to do it the right way and not the wrong way... otherwise the processor will try to interpret the code using the wrong instruction set and sooner or later crash the running process (it certainly won't be doing what the programmer intended... much as if you branched into some arbitrary data).

Quick Reference

The target instruction set state is determined is different ways depending on the type of branch.

Here's a very quick summary of kinds of branch, classified by context. This is not a list of all possibilities, just the most common/popular ones.


Traditional ARM instruction

Preferred instruction for ARMv7


function call (to label / named function)

BL <label>

BL <label>

The return address is saved in the Link Register (called lr or r14), and the processor then jumps to the specified label or function.

function return

MOV pc, lr

BX lr

In a system (like Ubuntu lucid) which contains a mixture of Thumb and ARM code, MOV pc, lr is usually not safe - BX lr should be used instead. However, this is not compatible with ARMv4 or earlier (used by Debian), so macros/#ifdefs may be needed.

function return via the stack

LDMFD sp!, {<registers>, pc} 

LDMFD sp!, {<registers>, pc} 

In order for this to work, the Link Register (lr) must have been pushed to the stack on entering the function - the LDMFD then restores the value directly into pc. For code which does not need to be compatible with old versions of the assembler, you can also use the newer form POP {<registers>, pc}.

non-returning branch / jump to label

B <label>

B <label>

no modification needed for ARMv7

non-returning jump to address in register

MOV pc, <register>

BX <register>

In a system (like Ubuntu lucid) which contains a mixture of Thumb and ARM code, use BX instead of MOV. #ifdefs may be required for compatibility with e.g., Debian

function-call to address in register

MOV lr, pc

BLX <register>

In a system (like Ubuntu lucid) which contains a mixture of Thumb and ARM code, use BLX instead if the traditional form. #ifdefs may be required for compatibility with e.g., Debian

MOV pc, <register>

non-returning jump to address in memory

LDR pc, [<register>]

LDR pc, [<register>]

no modification needed for ARMv7

function call to address in memory

MOV lr, pc

LDR <register2>, [<register1>]

The "traditional" version is ARMv7-compatible, but may not be recognised as a function call by the branch predictor, which can lead to wasted cycles due to extra branch mispredictions. #ifdefs may be required for compatibility with e.g., Debian

LDR <register2>, [<register1>]

BLX <register2>

Detailed Instruction Behaviour

Here is more detail on the behaviour of the various instructions which can be used to perform jumps, calls and returns.

Note that the optional condition code at the end of each instruction mnemonic is omitted.

  • B <label> no switch

    • This is (usually) the right way to do a non-returning jump to a label or function.
    • if <label> is an external symbol defined elsewhere, the linker will magically patch it up to switch appropriately.

    • Caution is required if ARM and Thumb are mixed in a single source file (rare), since there is no automatic instruction set switch for local symbols (in this unlikely case you may need to use BX instead). The assembler may or may not magically introduce a veneer/trampoline depending on whether or not it knows that the destination is in a different instruction set and is definitely a code symbol (à la .type <symbol>, %function or .thumb_func). Note that just because a symbol appears in a code section it is not assumed to be a code symbol unless specifically tagged in one of the aforementioned ways.

  • BL <label> no switch

    • This is (usually) the right way to do a procedure call to a label or function.
    • if <label> is an external symbol defined elsewhere, the linker will magically patch it up to switch appropriately --- but this does not guarantee a successful return unless the called function does a correct interworking return.

    • The link register (LR or r14) is automatically set to the return address, just follwing the call. The bottom bit of LR is automatically set to 0 (ARM) or 1 (Thumb) to indicate which instruction set to switch to when returning. (This is not automatically done if you use an instruction like MOV lr, pc to determine the return address --- this can lead to problems when the called function returns.)

    • Caution is required if ARM and Thumb are mixed in a single source file (rare). As for B <label>, the assembler may or may not magically introduce a veneer/trampoline depending on whether or not it knows that the destination is in a different instruction set and is definitely a code symbol (à la .type <symbol>, %function or .thumb_func).

  • BX <register> (register is usually lr): switches depending on bottom bit of <register>

    • The is one of the two right ways to do a procedure return.
    • Assumes the LR value (or whatever <register> is used) has the bottom bit set correctly to indicate ARM or Thumb (which will be the case it it comes from the LR value set by correct procedure call)

    • It is not supported on ARMv4, so for backwards-compatibility a workaround is needed. Note that BX lr is better for performance on newer processors, since MOV pc, lr may harm branch prediction performance.

#if defined (__ARM_ARCH_4__)
        "mov    pc, lr"
        "bx     lr"
  • BLX <register> (register should usually not be lr): saves the return address in lr and calls a function at the address held in the specified <register>. Switches instruction set depending on bottom bit of <register>.

    • This is the preferred way to do a procedure call to a computed or variable address.
    • The called function must still do a correct return.
    • This instruction is not supported on ARMv4(T), so code intended to build and work on Debian must use an alternative workaround. Because Debian does not use Thumb code, the following snippet is usually sensible (see "Computed destinations and returns" for an explanation of why this isn't safe for Thumb code, though):

#if defined (___ARM_ARCH_4T__) || defined (__ARM_ARCH_4__)
        "mov    lr, pc\n\t"
        "mov    pc, <register>"
        "blx    <register>"
  • LDR pc, [...] or POP {..., pc} or LDMFD sp!, {..., pc}: switches depending on the bottom bit of the value loaded for PC.

    • The other right way to do a procedure return.
    • Assumes you saved the LR value as part of the function prologue.
    • Assumes the LR value has the bottom bit set correctly to indicate ARM or Thumb (which will be the case of the procedure call was done in the right way)
    • Debian-compatible
    • Additional registers can be restored from the stack as part of the return, in the usual way.
  • MOV pc, <register>: no switch unless executed from ARM code AND the processor is >= ARMv7

    • Often the right way to implement inline jump tables (see "PC arithmetic and position-independent addressing")
    • Debian-compatible way to do a procedure return or computed/variable branch (ARM only)
    • Not recommended for procedure returns on ARMv7.
    • Generally, it's best to avoid relying on the interworking behaviour of this instruction, since newer ARM processors are not all optimised to do efficient branch-prediction in this case, so something like this is preferable:

#if defined (___ARM_ARCH_4T__) || defined (__ARM_ARCH_4__)
        "mov    pc, lr"
        "bx     lr"

Computed destinations and returns

Whenever a destination or return address is variable or otherwise determined at run-time, you need to be careful to set the "thumb bit" (bit 0) in the address correctly and/or do the correct type of branch, to make sure that the call (and return, if applicable) switch instruction set appropriately.

labels or functions:

  • If you reference an external label or function defined in another object, the linker will magically give you an address with the "Thumb bit" (bit 0) set appropriately. This means you can branch to it safely with bx or blx, or store it in memory and load it into PC later, pass it to other functions as a callback, etc.

  • If you reference an symbol internal to the object, life is more "interesting":
    • If the symbol is a C function the Thumb bit in the address you get will be set appropriately.
    • If the symbol is tagged using an assembler directive .type <symbol>, %function or .thumb_func, the Thumb bit in the address you get will be set appropriately. (GCC always does this for C functions.)

    • Otherwise, the Thumb bit will not be set appropriately.

    • When referencing GNU assembler local labels (0b, 1f etc.) the Thumb bit will not be set appropriately.


  • The current assembly location symbol in the GNU assembler (.) never has the Thumb bit set.

  • This behaviour is usually useful but sometimes unexpected. As a consequence, code like ldr r0, =. ; bx r0 is not an infinite loop in Thumb --- instead, it would re-execute the instructions as ARM any will probably crash the process.

  • Note that because b and bl do not switch instruction state, subs r0, r0, #1 ; bne . - 2 will work as a simple delay loop in Thumb (but you should never write it this way; see "PC and . arithmetic and position-independent addressing").

In cases where the Thumb bit is not set appropriately, it will simply be left as 0. For this reason, the distinction is not important when executing in ARM (where no instruction set change is implied by the Thumb bit), but is important in Thumb (where there may be an unintentional switch to ARM if you don't take corrective action).

The general rule is as follows: if the address will be passed to any other function or object (as a return address, method address, callback etc.) then you must ensure that the Thumb bit gets set if the code is assembler in Thumb.

However, if the little bit of code you're hacking knows that the Thumb bit is never set in the address, it may be safe not to set it so long as you bear this in mind. This can make sense for inline jump tables etc. -- see "Jump tables"

PC and "." arithmetic and position-independent addressing


ARM instructions cannot pack an arbitrary address into any instruction as an operand. For this reason, and if position-independence is required, a common coding technique is to address some data using offsets relative to the current instruction location.

On ARM, you can (sort of) read PC as if it were an ordinary register, and this can be used to determine the currently executing address though some adjustment is needed. This allows various position-independent coding tricks; however, the value you observe when "reading" the PC does not always match ARM and Thumb, and is not the same in all circumstances:

Instruction Set State

Usage Context

Instruction Example

Observed Value of PC


most instructions

address of instruction + 8


instructions which store PC to memory

STR pc, [sp, #4]

address of instruction + 8 or 12 (implementation-dependent, but consistent for a given device; avoid relying on this behaviour)


source operand in ADD, SUB or MOV

ADD r1, pc, #5

address of instruction + 8


source operand in ADD, or base register in load or store

LDR r2, [pc, #16]

(address of instruction & ~3) + 4


source operand in SUB or MOV

SUB r2, #8

address of instruction + 4


most other uses


address of instruction is equivalent to the value of the magic assembler location counter symbol "." if it appears in the same instruction, i.e. the virtual address of the first byte of the currently exceuting instruction (which in ARM is always a multiple of 4, and in Thumb is always a multiple of 2).

From the above, you can probably guess that it is hard to write code which will work in both ARM and Thumb if you reference the PC directly. Unfortunately, it is quite common to see such explicit references in hand-written traditional ARM assembler.

Some examples follow, along with explanation of how to make them Thumb-2 compatible.

Unless specified otherwise, the suggested workarounds are compatible with all assembler versions, even for assemblers which may not support Thumb-2.

Typical uses - loading a literal from the text section

Traditional way

        LDR     r0, [pc, #(data - . - 8)]       /* bad */
        LDR     r0, [pc, #<magic number>]       /* worse! */
        .long 0xdeadbeef

What happens in ARM

pc == . + 8, so r0 is loaded from the address (. + 8) + data - . - 8, = data

Even if you only expect to target ARM code, you should still port any occurrences of the /* worse */ hard-coded offset form of this idiom, since these references will tend to break under routine maintenance. It's always better to let the assembler track offsets and fix things up for you when you edit the code --- this is one of the main reason why assemblers were invented in the first place.

What happens in Thumb

pc == (. & ~3) + 4, so r0 is loaded from the address ((. & ~3) + 4) + data - . - 8, = data - 4 - (. % 2)

...which is unlikely to be what the programmer intended...

If the programmer coded the offset as an explicit magic constant offset in the LDR instruction (the /* worse */ form above), shrinkage in code size will change the actual offset from the LDR instruction to the literal data word, with the result that instruction data some other garbage is loaded into the target register instead of the intended value. You may even get a SIGSEGV if the address implied by the hard-coded offset lies past the end of the text segment.

How to resolve it

The common situation is that the code is trying to load a value which is too large to fit as an immediate operand in a MOV instruction, or which may need link-time relocation (the address of an object, for example). For ARM, the assembler has a special shorthand syntax for this:

        LDR     r0, =<value>

This automatically puts <value> somewhere in the text section (you don't usually need to care about exactly where) and assembles the correct PC-relative LDR instruction to load it into the target register. The code above will expand to LDR r0, [pc, #<offset>] followed somewhere later by .long <value>.

If you still want to load from a local text section label which you declare explicitly, don't try to be clever with explicit PC arithmetic, just use LDR <reg>, <label> syntax:

        LDR     r0, data
data:   .long   <value>

GNU as can be inconsistently strict about requiring the loaded value to be assembly-time constant when using the LDR <reg>, = syntax, so you may sometimes need the second, more explicit form if <value> is a more complex link-time relocatable expression.

In both cases, when generating ARM code the assembled output will be the same as what the programmer wrote in the first place.

In Thumb, the assembler will work out the correct PC offset for you, which is guaranteed to be stable since code sections are always marked as 4-byte aligned in object files.

Possible problem - location of the literal pool

Sometimes you will see assembler errors like this when using LDR <Rd>, =<value>:

/tmp/ccNzAyWv.s: Assembler messages:
/tmp/ccNzAyWv.s:29: Error: offset out of range

For C code, the compiler automatically ensures that any literal words are close enough to the corresponding pc-relative LDR instruction for the offset not to overflow. For assembler (including inline), this cannot be done automatically. The assembler will insert the literals in the next compiler-generated literal pool, or at the end of the file, whichever occurs first. ("literal pool" = each place in the text section where literal words are inserted)

Sometimes there is too much intervening code, and the required pc-relative address offset is too large to encode in a single LDR instruction, causing the assembler error shown above.

To solve this, you can tell the assembler to put the literal pool in a particular location using the .ltorg directive. For example:

_init:  LDR     sp, =initialStackTop
        B       start

        .ltorg                  @ insert literal data here

It's up to the programmer to ensure that the literal words are not executed as code. Putting the .ltorg directive immediately after an unconditional branch or unconditionl procedure return sequence is usually a good policy. If you really can't find a good location, you may need to put .ltorg in the middle of your code somewhere and use an unconditional branch to jump over it.

The above code will produce assembler output equivalent to the following:

_init:  LDR     sp, literal
        B       start

literal: .long  initialStackTop

(except that the assembler won't really define a label to reference the literal.)

Typical uses - getting the address of local data in the text section

Sometimes, code needs to get the address of local data in the text section, to pass to another function, or to perform table-driven jumps etc.

Note: this section only applies to obtaining the address of data only. For details on getting the address of code, see elsewhere.

Traditional way

Programmers who are not aware not aware of the ADR pseudo-instruction may write code like this:

        ADD     r1, pc, #(data - . - 8)

What happens in ARM

pc == . + 8, so r1 is is set to (. + 8) + data - . - 8, = data

What happens in Thumb

pc == (. & ~3) + 4, so r1 is set to ((. & ~3) + 4) + data - . - 8, = data - 4 - (. % 2)

This is probably wrong.

Even more confusingly, the adjustment for a PC-based SUB instruction in Thumb-2 is different from that for an ADD instruction, so the result will be different again if you try to reference a label occurring before the ADD instruction --- the assembler may transform the ADD into a SUB in this case.

How to resolve it

There is a special pseudo-instruction for getting the address of

        ADR     r1, data

The assembler resolves ADR to a PC-based ADD or SUB, with appropriate adjustments to the offset.

Typical uses - jump tables and local jumps

Traditional way

What happens in ARM

What happens in Thumb

How to resolve it

Typical uses - callbacks

Sometimes, you want to get the address of a function or piece of code to use as a callback, either by passing it to another function or by storing it in a structure where it may be read and used by some different code.

Traditional way

In ARM code you didn't need to care about how you got the address, because all methods work. As a silly example, suppose we're trying to register an assembler cleanup function with atexit(3)

Method 1: literal load

        LDR     r0, =func
        BL      atexit

        <do something> ...

Method 2: PC-relative

        ADR     r0, func        @ Remember, this is a pseudo-op for ADD/SUB r0, pc, #<constant determined by the assembler>)
        BL      atexit

        <do something> ...

What happens in ARM

In ARM, both methods "just work".

Method 2 is only allowed for nearby labels in the same section of the same source file, but it can be more compact and efficient.

What happens in Thumb

In Thumb, there are some things to bear in mind:

  • ADR never sets the "Thumb bit". You could add it yourself using ADR r0, func | 1 but then you need #ifdefs or some other special case magic depending on whether the code is built in ARM or Thumb, which is ugly and harder to maintain.

  • For external symbols, LDR <reg>, <symbol> will give you a value with the Thumb bit set appropriately.

  • For local symbols, LDR <reg>, <symbol> will not set the Thumb bit unless the symbol is designated as a function symbol.

How to resolve it

        LDR     r0, =func
        BL      atexit

.type func, %function
        <do something>

This ensures that the Thumb bit in the address gets set, if you happen to compile the code in Thumb. If you compiled in ARM, the Thumb bit will be clear, as normal. Provided that the code which uses the value performs a correct interworking function call, it doesn't need to know whether your assembler was ARM or Thumb when calling back into it.

Jump tables


The SWP instruction

The SWP instruction performs a locked read-write operation on a piece of memory, similar to the x86 xchg instruction. This can have a bad impact on performance in modern systems with a complex hardware architecture and/or multiple processors or other bus masters, so this instruction is deprecated.

Because the Thumb-2 instruction set was introduced after SWP became deprecated, there is no encoding for SWP in Thumb-2 at all; so SWP is not allowed when building for Thumb-2; this will lead to build failures.

See "Atomic Operations" for a more general discussion of how to port these cases.

Atomic operations and SMP safety (SWP, LDREX, STREX; memory barriers)

A number of packages contain homegrown atomic operation implementations.

Atomic operation primitives

Prior to version 6 of the ARM architecture, there were only two ways to do atomic operations:



via the kernel

The kernel can ensure that no other user thread interrupts the operation if necessary, but transitioning in and out of the kernel is relatively expensive and slow.

using the SWP instruction

SWP does an atomic memory read-write operation, analogous to lock xchg on x86. This is also quite expensive on modern platforms, since the whole system bus must be locked for the operation. SWP (and the byte-sized version SWPB) don't scale well to multicore platforms, and are deprecated from ARMv7 onwards. SWP and SWPB are not permitted in Thumb-2 code. SWP can be used to implement a simple mutex, but more complex primitives (such as atomic increment, bit set, etc.) must usually be constructed using a mutex as a base.

ARMv6 introduces a new mechanism, known as "exclusives", using the LDREX and STREX instructions. Direct use of these instructions is not recommended for new code (see "GCC atomic primitives" for a better alternative). However, to assist with understanding existing code, a quick overview follows:

        LDREX   r0, [r1]
            /* do something with r0 */
            /* no other load or store operations can occur between a LDREX and its corresponding STREX */
        STREX   r2, r0, [r1]
            /* now r2 = 0 if the new value was stored */
            /* r2 = 1 if the store was abandoned */

Because the STREX is allowed to "fail", it is no longer necessary to lock out other bus activity, or stop other threads from executing (whether on the same CPU or on another CPU in a multiprocessor system); atomic use in a thread can preempt a pending atomic in another thread. For this reason, these primitives are usually also placed in a loop. Note that extra load and store instruction (of any kind) between LDREX and STREX can cause the STREX always to fail in some implementations, which is why you shouldn't access memory between the LDREX and the corresponding STREX, except in cases where it's sensible to abandon the operation such as context switches or exceptions.

For example:

/* atomically increment a word: */

        LDREX   r0, [r1]
        ADD     r0, r0, #1
        STREX   r2, r0, [r1]
        CMP     r2, #0
        BNE     try_to_increment

Note: The mechnisms used to make exclusives work are completely different from those used by SWP, so when porting code that accesses a particular instance of an atomic object, you must port all the code accessing that object, not just some of it. This is not usually a problem because a given atomic object will generally be managed by a single piece of library code shared between the threads or processes accessing it.

Spinlocks are relatively straightforward to implement using these primitives; but many other issues need to be taken into account in order to implement a robust multithreaded primitive such as a conventional semaphore or mutex.

Generally it is preferable to use library functionality rather than reinventing sophisticated primitives.

Memory access ordering and memory barriers

ARMv6 and later have a weakly-ordered memory model. C and many other high-level languages also have a data model which allows memory access to be done in a different order from that specified in the source code.

This means that when a program performs a sequence of memory accesses, then program (or another thread or other bus master; DMA controller etc.) is not guaranteed to see the memory accesses happen in the same order; indeed some accesses may not be observed at all.

This is a particular concern when performing atomic operations, since atomics usually exist to protect simultaneous access to some data structure or some other resource.

For example, when taking a lock it is necessary that everything in the system knows the lock has been taken, before anyone sees modifications happen on the object which the lock is intended to protect.

The following example demonstrates how a large, 128-bit counter variable might be reset. Note that we assume here that there is no way to force the processor to change the whole 128-bit value in a way which is fully atomic:

        BL      get_mutex

        /* ! */

        /* now we modify the object protected by the mutex: */
        LDR     r1, =object
        MOV     r0, 0
        STR     r0, [r1]
        STR     r0, [r1, #4]
        STR     r0, [r1, #8]
        STR     r0, [r1, #12]

        /* ! */

        BL      release_mutex

Due to caching effects, and reordering of operations in the processor pipeline, a second processor might see the mutex get acquired after some of the STR instructions have completed, which can lead to the counter being observed in an unexpected state. Indeed, the second processor might even see the STRs occur out of order, though that does not matter in this particular case. Replacing the STR instructions with a single instruction such as STMIA does not necessarily solve either problem, since a single instruction can still be split into multiple underlying memory accesses.

To solve this, it is necessary to ensure everything in the system sees the memory accesses which form part of get_mutex occur before the object is modified, and sees all the memory accesses which modify the object before release_mutex is called.

Clearly, the ideal place to do this is in the implementation of get_mutex and release_mutex --- which is why SMP safety must be considered carefully when implementing or porting such operations (even if LDREX and STREX are used to implement them).

The solution is to place a Data Memory Barrier instruction at the locations marked /* ! */, or inside get_mutex and release_mutex to ensure the correct access order is observed by everything in the system.

On ARMv7, there is a DMB instruction which performs this operation. There is also an MCR p15, ... operation which is backwards-compatible the earlier architecture versions.

The ARM Architecture Reference Manual contains a full discussion, but generally it is more robust and portable to rely on libraries for atomic operation facilities, rather than delving into assermbler.

GCC atomic primitives

Using assembler generally impairs portability. As well as causing compatibility problems between different architectures, it causes a problem for backwards and forwards portability within a single architecture family, where the details of atomic memory access may evolve over time.

Since at least GCC 4.4.3 (most architectures) and linux-2.6.19 (additional requirement for ARM), some compiler intrinsics are available which abstract these operations.

Refer to the GCC info documentation for details (search for __sync_synchronize).

The formal specification of the atomics can be found at


Avoid specifying extra optional arguments for the atomics to specify which objects the accompanying memory barrier(s) should be applied to. GCC's behaviour may change in the future, and there is some uncertainty about what exact behaviour the spec requires in this case. Omitting the extra arguments causes the barrier to apply to all objects --- this may be wasteful in some cases, but should be safe.

The functions __sync_lock_test_and_set() and __sync_lock_release() are specifically for implementing a non-recursive mutex. It is best not to attempt to use them for other purposes, because they have different memory barrier behaviour from the other atomic intrinsics. Similarly, it's best not to hand-craft a non-recursive mutex using other operations, since the result may be less efficient.

Using __sync_synchronize() by itself (not accompanied by other atomic operations) is sometimes sensible, but often inappropriate. If you find yourself wanting to sprinkle __sync_synchronize calls around your code, it's recommended that you stop and think carefully before proceeding.

Locking functions should be declared as inline and included via headers if they are used to protect objects which are not considered globally visible. This is required for the compiler to understand the data barrier implications at the source code language level. An object is only guaranteed to be considered globally visible if it is declared volatile, if it has external linkage (extern, or non-static declaration at module level), or if a pointer to the object is passed to a function outside the compilation unit (usually anything in a separate .c/.c++ file, but care is needed --- with the GCC option -funit-at-a-time the compilation unit might span multiple files). This requirement is not specific to the GCC atomic intrinsics, so it shouldn't generally be encountered while porting.

Porting Issues

One problem which arises when porting existing atomics code is that the context in which the code is used are not always clear, and as a consequence the memory barrier requirements may not be obvious. For example, if your package has a function which performs an atomic compare-and-swap, its memory barrier requirements depend on how the function is used.

  • If the function is used by the caller to obtain exclusive access to something (i.e., try to acuiqre lock), a barrier is needed at the end.
  • If the function is used by the caller to release exclusive access previously obtained (i.e., try to release a lock), a barrier is needed at the beginning.
  • If the function is used for both purposes, or if you're not 100% sure exactly how the function is used, both barriers shold be included.

In general, if the function uses the atomic intrinsics, you do not need to add any explicit barriers. If unsure, or if you suspect portability problems, you can add explicit barriers using __sync_synchronize().


Here's how a simple non-recursive mutex might be implemented using the GCC primitives:

typedef int mutex;

static inline bool mutex_try_lock(mutex *m)
    return __sync_lock_test_and_set(m, 1) == 0;

static inline void mutex_unlock(mutex *m)

Operand combinations

Thumb-2 is a bit more restricted with regard to instruction operands (though in some cases, Thumb-2 is actually more flexible than ARM). You may therefore find that you occasionally get assembler errors when assembling existing code for Thumb-2.

To work around this, you may need to try the following approaches:

  • For "branch out of range" type errors, use a different branch type, rearrange the code, insert a trampoline somewhere etc.
  • For "index out of range" type errors in load and store operations, you may need to manually add (all or some of) the desired index offset to the base register in an explicit instruction before the load or store.

Use of PC and SP


  • PC may also be denoted by "r15" in assembler source.
  • SP may also be denoted by "r13" in assembler source.

Generally, doing fancy stuff with the program counter and stack pointer registers is deprecated in ARMv7, and may not be allowed in Thumb-2 at all. Existing code which does some things will generally need some porting (though it is generally safe not to worry if you don't get errors or warnings when building).

As a general rule, do check for assembler warning messages (not just errors) when building the affected code.

If it sounds from the register's name like it may not have been designed for what you're doing, you are probably doing "fancy stuff" and should avoid it... Wink ;)

Future versions of the ARM architecture might reuse the corresponding instruction encodings to do something completely different, causing mysterious forwards-compatibility errors

For the definitive reference, you should consult the ARM Architecture Reference Manual.

Brief details:

Stack Pointer (SP) register

For SP, you can push, pop, ldmfd sp!,..., stmfd sp!,... or add or sub or mov, but you should generally avoid doing anything else, and do not use the as a destination register in other operations, or attempt to push or pop sp itself from the stack. Older assemblers may not accept push and pop in ARM code; this may be (but probably is not) an issue for Debian compatibility.

Using SP as a base register for load and store operations is allowed, and you may add an offset to the address, but you may not multiply/shift/scale SP.

Upwards-growing stacks (ldmea sp!, ..., stmea sp!, ... etc.) are deprecated, but it is rare to encounter these.

Program Counter (PC) register

For PC, you cannot use it as the destination register in most operations, except for mov, ldr, and pop {..., pc} or ldmfd sp!, {..., pc}.

Using the PC as a source register in simple operations (add <reg>, pc, ..., sub <reg>, pc, ..., or mov <reg>, pc) is allowed but may produce different results in Thumb compared with ARM, and non-Thumb-aware code which does these things will need to be ported. In particular, attempts to manually determine a return address (mov lr, pc or similar) or index inline jump tables (ldr pc, [pc, <index>]) or similar may need attention.

Using the PC as a base address register (i.e., the first operand inside the brackets [pc ...]) is allowed in simple load and store operations, but you may not multiply/shift/scale or auto-update the PC (! or [pc],<index> syntax). Again, the results may be different between ARM and Thumb.

See "PC arithmetic and position-independent addressing" for more detail on this and how to handle these cases.

You should generally not push PC onto the stack or store it via push, str, stmfd etc.; some of these operations may not be allowed in Thumb-2, and results may differ between ARM and Thumb. However, loading PC from the stack is explicitly allowed as one of the "correct" ways of doing a procedure return - see "Procedure calls and returns".

RSC instruction (Reverse Subtract with Carry)

The RSC instruction is not available in Thumb-2.


  • re-code using instructions such as SBC; or

  • build the affected code as ARM.

For example, the following ARM code performs integer negation on a 64-bit operand held in two registers:

        RSBS    r0, r0, #0
        RSC     r1, r1, #0

The same task can be performed using Thumb-2 compatible code such as:

        RSBS    r0, r0, #0
        ADC     r1, r1, #0
        RSB     r1, r1, #0

        @ result is in r1:r0


        RSBS    r0, r0, #0
        SBC{S}  r3, r3, r3      @ r3 := -1 if there was a carry in the RSB instruction, 0 otherwise
                                @ The initial value of r3 does not affect the result.
        SUB{S}  r3, r3, r1

        @ result is in r3:r0

The second example can produce more compact code because 16-bit encodings are available for the all these instructions if the flag setting variants (optional S, indicated by {S})) are used; this comes at the expensive of increased register footprint and a false dependencies on the condition flags and the initial value of r3. Note that false dependencies can affect performance in some circumstances.

In inline assembler changes to the registers in which the output appears usually doesn't matter, provided proper asm constraints are written so that the compiler can make sure everything ends up in the right place --- this consideration is not ARM-specific.

Types of Assembly Language

In order to make correct porting decisions it's important to understand what kind of assembler you're looking at, and the instruction set it will be assembled for. The primary reason for this is to understand the requirements for interworking (function calls between ARM and Thumb code) to work properly.

There are a few possibilities here:

Traditional ARM assembler

  • out-of-line assembler files (.s, .S)

  • the following directives are not present in the source: .code 16, .thumb, .thumb_func, .syntax unified

  • 3- or 4-operand instructions present (e.g., add r0, r1, #3) and arbitrary conditional instructions (e.g., subgt r1, r4, r5)

  • assembles to fixed-size 32-bit instructions

Unified assembler

  • out-of-line assembler files (.s, .S)

  • the following directive is present in the source: .syntax unified

  • code looks similar to traditional ARM assembler
  • hashes (#) in front of immediate operands are not required and may be absent

  • except for conditional branches, conditional instruction sequences must be preceded immediately by it directives (such as itte eq / moveq r0, r1 / subeq r2, 1 / movne r0, 0

  • assembles either to fixed-size 32-bit (ARM) instructions, or mixed-size (16-/32-bit) Thumb-2 instructions, depending on the presence of .code, .thumb, .arm directives etc.

"Hybrid" assembler

  • all GCC inline assembler (.c, .h, .cpp, .cxx, .c++ and so on)

  • code intended to build for Thumb-2 (lucid default) or ARM, depending on GCC configuration and command-line switches (-marm, -mthumb)

  • code should be understandable as tradational ARM assembler and unified assembler

  • it blocks may be present (but are inferred otherwise --- may be better to omit them unless it is critial for performance, for compatibility with older Debian tools etc.)

  • hashes (#) in front of immediate integer constants should be present (for compatibility with the older assembler syntax).

  • assembles either to fixed-size 32-bit (ARM) instructions, or mixed-size (16-/32-bit) Thumb-2 instructions (lucid default), depending on the GCC configuration and command-line options (-marm, -mthumb).

Traditional Thumb-1 assembler (rare)

  • You probably won't see any of this!
  • out-of-line assembler files (.s, .S)

  • any the following directives are present in the source: .code 16, .thumb, .thumb_func

  • the following directive is not present in the source: .syntax unified

  • mostly 2-operand instructions (add or sub can sometimes have 3)

  • all instructions are unconditional except for branches (beq etc.)

Conditional Execution

In the ARM instruction set, most instructions have a condition code as part of their encoding and can be executed predicated on arbitrary conditions. For example, ARM programmers will be familiar with techniques such as:

        CMP     r1, #0x41
        MOVEQ   r0, #1
        MOVNE   r0, #0

        /* r0 == 1 if r1 == 0x41; r0 == 0 otherwise */

The Gory Details

In Thumb-2, most instructions to not have a built in condition code (except for conditional branches). Instead, short sequences of instructions which are to be executed conditionally can be preceded by a special "IT instruction" which describes the condition and which of the following instructions should be executed if the condition is false respectively. Up to four instructions can be predicated in this way.

The IT instruction is similar to an "if-then-else" construct in high-level languages:

        IT      EQ      /* if the EQ condition is true, (T)hen execute the next instruction */
        MOVEQ   r0, #1  /* note the redundant condition code, which needs to match the IT instruction */

or (matching the ARM example above)

        ITE     EQ      /* if the EQ condition is true, (T)hen execute the next instruction (E)lse execute the instruction after it */
        MOVEQ   r0, #1
        MOVNE   r0, #0  /* note that the condition code must be inverted, since this is an "else" instruction */

Up to 4 instructions can be predicated, and the "if" and "else" instructions can be interleaved, e.g.:

        ITETT   EQ
        MOVEQ   r0, #1
        MOVNE   r0, #0
        MOVEQ   r1, #0
        MOVEQ   r2, #0

Finally, you should not branch out of an IT block except right at the end of the block --- otherwise you get undefined behaviour. It's also doubtless a bad idea to branch into the middle of an IT block from outside.

For compatibility with both ARM and Thumb, the IT block construct is always understood when using unified assembler syntax, regardless of whether you assemble for ARM or Thumb-2. However, for some reason as actually demands by default that the IT blocks are present, which means you get assembler errors if using unified syntax (e.g. to target Thumb-2) unless a command-line option is specified to override this (see the following paragraphs).

The assembler checks the IT block for consistency; then, if the code is assembled for ARM, the IT instruction is discarded. Otherwise, if the code is assembled for Thumb-2, the conditions on the predicated instructions are discarded. This way, the resulting code behaves the same way in both cases.

The Easier Way

Because manually adding all the IT instructions is a bit painful, the assembler can guess them. To turn on this feature, specify the assembler command-line option -mimplicit-it=<when>, where <when> can be never, arm, thumb or always. In additional, the checking is only done when using unified assembler syntax. For backwards compatibility, it generally makes sense to turn on unified assembler syntax only when assembling for Thumb-2, in which case -mimplicit-it=thumb is a sensible choice. (Note that the assembler is clever enough to split blocks containing a branch when guessing implicit IT blocks, so you don't need to worry about this yourself.)

In lucid, explicit IT blocks are not needed in GCC inline assembler, because GCC uses unified assembler syntax only when compiling for Thumb-2, and passes -mimplicit-it=thumb to the assembler during high-level language compilation.

Building Stand-Alone Assembler Files in Thumb-2

The GCC default settings do not apply to stand-alone assembler files (.s, .S, .asm, .inc), whether or not they are passed through the compiler.

Enabling Thumb-2

Instead, as always defaults to assembling in ARM for assembler files, using the traditional syntax.

Note: For historical reasons as does support a -mthumb command-line option, but it doesn't do what you want: this enables the older Thumb-1, which has it's own, incompatible syntax. To produce Thumb-2 code you still need to turn on unified assembler syntax, which can only be done using the .syntax unified directive in the assembler source. To keep everything in one place and avoid confusion, it's better to control the assembler output mode with directives rather than attempting to control it on the command line.

If you want to build pre-existing stand-alone assembler files as Thumb-2, you need to do the following:

  • Port the code as required to be Thumb-2 compatible (as documented on this page).
  • Conditionally add the following directives, when building the package for Thumb-2:

.syntax unified /* use unified assembler syntax */
.code 16        /* assemble in Thumb-2  (.thumb" can also be used) */
  • For compatibility with older tools, these directives should not be included when building for ARM.
  • Add the following command-line option to the assembler:

  • Generally you should not add IT directives to the source by hand, since this will be incompatible for older tools. You might want to add them later when optimising performance, but they will need to be conditionally assembled.

  • Annotate every global function as a code symbol, using one of the following methods:

    • use the directive .thumb_func immediately before the label starting the function (must be assembled conditionally, only if building for Thumb); or

    • use the directive .type <label>, %function.

    • Do not annotate any data labels in this way.
    • Note that calls and jumps to another location in the assembler source file, if present, need careful handling. If the destinations of such jumps (including local symbols) are annotated as above, you must generally use Thumb-2 compatible call and return sequences. (bl <label> is usually the correct way to call.) If the destinations are not annotated (numbered local symbols such as 0: can never be annotated), the Thumb-2 compatible call and return sequences such as bx may not work correctly; it is sometimes safe to leave the traditional mov pc sequences intact, for local calls and jumps only. See "Procedure Calls and Returns" for more discussion of the relevant issues here.

Detecting the Architecture Version

As when porting inline assembler in C, various code snippets may need to be conditionally assembled depending on the target architecture.

For assembler which is passed through the C preprocessor, you can use the C predefined macros and #ifdefs for this purpose.

For stand-alone assembler files some other technique must be employed, such as:

  • Define conditional assembler macros (or symbols for use with .if) in include files; package configuration drives selection of the include files (by examining GCC's predefined macros when invoked with CFLAGS for example).

  • Predefine symbols (use the --defsym command-line option in ASFLAGS).

  • Use an additional script or macro processor (I've seen m4 used) to preprocess the assembler source.

An unpleasant, but simple hack would be to generate an assembler include file from the GCC predefined macros using a technique similar to this:

gcc -dM -E -x c /dev/null | grep -i 'arm\|thumb\|vfp\|neon' | sed -ne 's/^#define *\([^ ]\+\) *1 *$/.equ \1, 1/p' the user's own risk of course. This assumes that C macro names are always valid assembler symbols (which is the case for ARM).

Checking for Correct Definition of Thumb-2 Function Symbols

To check whether function symbols will interwork properly in code compiled/assembled as Thumb, you can use readelf.

Note: you cannot use objdump for this test. It helpfully masks off the bottom bit of each address when listing the symbol table, hiding the information that you need to see.

For example:

$ readelf -s thumb2.o

   Num:    Value  Size Type    Bind   Vis      Ndx Name
     5: 00000002     0 NOTYPE  LOCAL  DEFAULT    1 g
     7: 00000001     0 FUNC    GLOBAL DEFAULT    1 f

Here, f is appropriately defined, but g is not.

A Thumb-2 code symbol is properly marked for interworking calls only if both of the following are true:

  • The symbol type is function symbol (FUNC)

  • The symbol value has the bottom bit set (i.e., is an odd number).

.type <symbol>, %function is usually the best way to achieve this in the assembler source.