S3

Power Management S3 Tricks and Tips

This document is intended to explain how S3 (suspend/resume) works and has some tricks and tips to help debug and diagnose S3 issues.

S3 is the ACPI sleep state that is known as "sleep" or "suspend to RAM". It essentially turns off most power of the system but keeps memory powered.

Overview of S3 Suspend

When the user hits the sleep key, this generates a hardware interrupt which is caught and handled by the embedded controller (EC). ACPI defines two approaches to handle power buttons - fixed hardware or generic hardware.

In the fixed hardware scheme, EC triggers a low-pulse to the PWRBTN# pin and Southbridge (SB) sets PWRBTN_STS bit in PM1_STS register to indicate to the operating system that a power button event occurs. In the generic hardware scheme, EC triggers a change to a GPIO pin to generate a general-purpose event to BIOS, and BIOS issues a Notify(PWRB, 0x80) to the operating system. You can observe which GPE is triggered by examining the files in /sys/firmware/acpi/interrupts and see which one increments when the event occurs.

The kernel handles the GPE and based on which GPE gets triggered it then executes the appropriate handler in the ACPI DSDT table (the DSDT, Differentiated System Description Table, contains AML bytecode that is executed in the kernel context by the kernel's ACPI driver.

For example, if GPE 0x1e is triggered then AML bytecode method \_GPE._L1E() or \_GPE._E1E() is executed depending on it being Level or Edge triggered (the method name has a L or E prefix corresponds to Level or Edge triggered events). Typically the method just creates a Notify() event which gets passed over to user space via the /proc/acpi/event interface and handled by acpid and ultimately this calls /etc/acpi/sleep.sh which in turn calls pm-suspend.

Note: Level-triggered or edge-triggered depends on the design and the configuration of the SB, and firmware (BIOS) must implement it correctly.

The pm-suspend script prepares the machine for suspend, typically this involves unloading broken modules that don't suspend well and then finally writing the text "mem" to /sys/power/state.

Writing to /sys/power/state initiates the kernel side of the suspend. First it suspends all user space tasks (freezing them). Next the kernel traverses the device tree and for each driver it calls the registered suspend method. It is the responsibility of each driver to ensure the correct state is saved to allow the hardware to resume correctly and this is where a lot of bugs can occur.

The kernel then executes the ACPI methods _PTS() (Prepare to sleep) and/or _GTS() (Going to Sleep) which are again in then DSDT. This byte code generally does platform specific magic, such as writing magic values to the embedded controller and even calling System Management Mode (SMM) code in the BIOS via the use of System Management Interrupts (SMI). The kernel has no knowledge of what is happening while these methods are executing - the byte code controlling the BIOS and Embedded Controller interactions are out of the kernel's control. Obviously if the AML byte code or BIOS code being executed in SMI is buggy it can cause S3 issues and there is little can be done to easily fix this in the kernel.

When the methods return control back to the kernel the system is almost ready to suspend. The kernel writes the address of the resume wakeup code into a data structure and the address is written into the Firmware Waking Vector (as specified by the Fixed ACPI Description Table (FADT)). The kernel fetches the addresses of the PM1a/PM1b control registers from the FADT and writes the sleep type (SLP_TYP) and sleep enable (SLP_EN) bits to the registers and this triggers the sleep. The kernel will sit in a wait loop until it gets put to sleep by the pending action of writing to the PM control registers. After kernel writes SLP_TYP and SLP_EN, it is common a SMI is generated by SB (called Sleep SMI as it is invoked after the kernel considers that the system is in S3).

Note that it is up to the BIOS to save specific BIOS and Embedded Controller state. This is normally done in the _PTS/_GTS Method calls and in Sleep SMI. Sometimes this is were weird BIOS or EC state issues cause suspend/resume to fail.

A break down of the calls in the kernel is as follows:

state_store(): 
  handles write of "mem" to  /sys/power/state
  calls enter_state() to enter the S3 state

    enter_state():
      syncs filesystem
      calls suspend_prepare():

        suspend_prepare():
          prepares console for suspend
          freezes processes
          calls suspend_devices_and_enter():

            suspend_devices_and_enter():
              suspends devices
              suspends console
              calls suspend_enter():

                suspend_enter():
                  prepares all devices for power down
                  disables all CPUs apart from boot CPU
                  disables interrupts
                  calls acpi_suspend_enter()

                    acpi_suspend_enter():
                      flushes CPU cache
                      calls acpi_save_state_mem()

                        acpi_save_state_mem():
                          sets wakeup_header for resume
                          32 bit: return via wakeup_pmode_return
                          64 bit: trampoline via setup_trampoline()
                          calls do_suspend_lowlevel() (assembler)

                            do_suspend_lowlevel():
                              save processor state
                              save registers
                              call acpi_enter_sleep_state()
                         
                                acpi_enter_sleep_state():
                                  disables all GPEs
                                  enables all wakeup GPEs
                                  executes the _GTS() ACPI Method (if it exists)
                                  Set SLP_TYP in PM1a and PM1b registers
                                  flush CPU caches
                                  Set SLP_TYP + SLP_EN in PM1a and PM1b registers
                                  idle for a while, eventually the PM1a/1b registers kick
                                  in and machine goes into S3

Overview of Resume from S3

Eventually the user wants to wake their machine up from S3 suspend. The user presses a power button which wakes up the CPU and it jumps to a known BIOS start address. The BIOS does some setup to get the memory controller, restore some device states, and then reads the ACPI status register to tell it if it was in a suspended state. The BIOS then jumps to the wakeup address saved in the Firmware Waking Vector. The kernel resume code pointed to by the Firmware Waking Vector is some real mode x86 code that sets the CPU back into normal kernel protected mode and then restores CPU register state and pops back down the call suspend call chain to do the resume. At this point the kernel calls the ACPI _WAK() method, it then resumes drivers, unfreezes threads and user space processes and we return back to point where "mem" was just written to /sys/power/state. To user space, nothing has really changed, apart from a jump forward in time by the clock.

Resume from S3 in Detail

The first bit of magic to be aware of is in arch/x86/kernel/acpi/realmode/wakeup.S

wakeup_header:
video_mode:     .short  0       /* Video mode number */
pmode_return:   .byte   0x66, 0xea      /* ljmpl */
                .long   0       /* offset goes here */
                .short  __KERNEL_CS
                ...

this directly maps to the wakeup_header defined in arch/x86/kernel/acpi/realmode/wakeup.h:

struct wakeup_header {
        u16 video_mode;         /* Video mode number */
        u16 _jmp1;              /* ljmpl opcode, 32-bit only */
        u32 pmode_entry;        /* Protected mode resume point, 32-bit only */
        u16 _jmp2;              /* CS value, 32-bit only */
        ...
}

during suspend, header->pmode_entry is set to wakeup_pmode_return() in function acpi_save_state_mem() and kernel state is saved into a wakeup header which is a real mode copy (to acpi_realmode) of a chunk of code and data in wakeup_code_start..wakeup_code_end. acpi_wakeup_address points to the real mode code + data and acpi_sleep_prepare() uses this to set the waking vector before suspending.

To understand resume, we need to examine the final phase of suspend. Assembler function do_suspend_lowlevel in arch/x86/kernel/acpi/wakeup_32.S performs the final phase of suspend, it basically does the following:

  • saves cpu state + registers
  • calls acpi_enter_sleep_state() which ultimately ends end suspending and CPU halts. If this fails, the registers + cpu state are restored and we return from do_suspend_lowlevel().

The wakeup from S3 works as follows:

  • BIOS (realmode) inspects the ACPI waking vector and jumps out of BIOS real mode into wakeup_code() in arch/x86/kernel/acpi/realmode/wakeup.S. Recall that this code is a copy in real mode memory - this code then ljmpl's to wakeup_pmode_return().
  • wakeup_pmode_return() in arch/x86/kernel/acpi/wakeup_32.S restores registers, the gdt, idt and ltd (hence comes out of real mode), and restores the stack pointer does a final sanity check to see if a saved magic value is as expected and if all is OK resturns back to pop the stack and return from do_suspend_lowlevel() as if nothing has happened.

Fixing S3 Issues

As you can now see, suspending and resuming is rather non-trivial. The following sections explain how where to look for S3 bugs:

Drivers

We rely on drivers behaving correctly in their suspend/resume methods, if state is not saved/restored correctly then broken drivers can break in subtle and mysterious ways. Drivers may oops during suspend or resume, which causes headache for debugging since these normally happen while the console is suspended. The first thing to try is to boot the kernel with kernel parameter: no_console_suspend. Next switch to VT1 and suspend/resume using Ctrl-Alt-F1 or:

sudo chvt 1

login on VT1 and suspend using:

sudo pm-suspend

Repeat several times to see if you can capture a kernel oops. Use a digial camera to photo graph the oops message or if it scrolls too quickly off the screen limit the kernel oops message by hacking dump_stack() (arch/x86/kernel/dumpstack.c) to dump out less off a stack trace.

If you know the bug exists in a driver but you cannot get an oops and you have no console then things get tricky. The PC does can only save state in the real time clock (RTC) over reboots, so we have to result to saving a hashed state of the device suspend details in the RTC. We use the /sys/power/pm_trace interface to enable PM debugging as follows:

sudo sh -c "sync; echo 1 > /sys/power/pm_trace; pm-suspend

If resume fails, quickly reboot the machine as you have ~3 minutes before the updates to the RTC corrupt the saved hashed state.

Once rebooted, look for the "Magic numbers:" text in the kernel log, use dmesg, and look for something like:

[   11.323206]   Magic number: 0:798:264
[   11.323257]   hash matches drivers/base/power/resume.c:46

You may even get lucky and get a device being mentioned as the problematic driver, e.g.:

[    11.327271]  hash matches device i2c-9191

Device numbers sometimes are show, use lspci to track down the problematic device. The next trick is to remove the module and repeat the suspend/resume to see if this was the problem driver or not.

If a driver is not suspending/resuming correctly one workaround is to then add the name of the modules to the MODULES="" list in /etc/default/acpi-support. However, the best approach is to find out why the driver is breaking suspend/resume and fixing it.

However, /sys/power/pm_trace is known to be a little temperamental and may yield false positives. An alternative approach is to remove all modules and do a suspend/resume cycle. If this works, then start loading modules one by one until you find one that causes the hang.

BIOS + ACPI

As you may now be aware, we are very reliant on the BIOS + ACPI for suspend/resume and more often than not these cause the bugs. The issues fall into several categories:

BIOS hangs

In this scenario, the kernel passes control over to BIOS, and we BIOS never returns control back to the kernel. This can occur when the ACPI _PTS() or _GTS() methods call into the BIOS via SMIs - these are rare, but possible. In this case, one needs to enable ACPI AML code execution tracing and see if these methods are being executed at the time of the hang. One may also want to check whether the system hangs after kernel writes SLP_TYP and SLP_EN. If so, BIOS may hang in Sleep SMI.

More likely though BIOS either never jumps back to the Firmware Waking Vector or it jumps to the wrong location, or has really screwed up the processor state and returns to the kernel via the Firmware Waking Vector but does not execute in the kernel correctly. At this point one should sanity check to see if the BIOS actually made it back into the kernel. One of a handful tricks can be used:

1. Write some code in wakeup_pmode_return() (arch/x86/kernel/acpi/wakeup_32.S) to flash the LEDs on to indicate the BIOS jumped back to the kernel. Example code below:

#define I8042_DATA_REG          $0x60
#define I8042_STATUS_REG        $0x64

       /* Flash 16 times */
       movl    $0x10, %ebx
flashy:
       /* Turn on LEDs, don't trust stack at this point */
1:     inb     I8042_STATUS_REG, %al
       testb   $0x02, %al
       jne     1b
       movb    $0xed, %al
       outb    %al, I8042_DATA_REG

       movl    $0x1000, %eax
1:     subl    $1, %eax
       cmpl    $0, %eax
       jne     1b

1:     inb     I8042_STATUS_REG, %al
       testb   $0x02, %al
       jne     1b
       
       movl    $0x1000, %eax
1:     subl    $1, %eax
       cmpl    $0, %eax
       jne     1b

       /* LEDs ON */
       movb    $0x07, %al
       outb    %al, I8042_DATA_REG

       /* delay */
       movl    $0x07878787, %eax
1:     subl    $1, %eax
       cmpl    $0, %eax
       jne     1b

       sub     $1, %ebx
       cmpl    $0, %ebx
       jne     flashy

unfortunately a lot of the newer machines don't seem to even have the luxury of a keyboard LED, so you may need to try the following strategies:

2. Write some code in wakeup_pmode_return() (arch/x86/kernel/acpi/wakeup_32.S) to dump state in port $80 and use a port $80 debug card. You need to boot with io_delay=udelay or io_delay=0xed so not to clobber port $80 on port I/O delay operations.

3. Write some code in wakeup_pmode_return() (arch/x86/kernel/acpi/wakeup_32.S) to zap the CMOS settings on resume. When the machine hangs, reboot. Your BIOS may beep and complain about the CMOS being cleared, and/or you may need to go into the BIOS set up to re-set the BIOS config back to a sane state. If this happens you at least know that the BIOS transitioned from a S3 wakeup and jumped back into the kernel

       movb    $0x30, %bl
clrloop:
       movb    %bl, %al
       outb    %al, $0x70
       outb    %al, $0x80      /* delay */

       movl    $0x1000, %eax
1:     subl    $1, %eax
       cmpl    $0, %eax
       jne     1b

       movb    $0x00, %al
       outb    %al, $0x71
       outb    %al, $0x80      /* delay */

       movl    $0x1000, %eax
1:     subl    $1, %eax
       cmpl    $0, %eax
       jne     1b
       
       sub     $1, %ebx
       cmpl    $0, %ebx
       jne     clrloop

ACPI bugs

We are very reliant on ACPI to do things right, sometimes it just does not. A few things worth checking are:

  • Method _PTS(). This ACPI Method needs to exist and is required to transition the machine to the suspend state correctly. Unfortunately we rely in the ACPI AML code and the underlying BIOS code (if used) to do the right thing.
  • Method _GTS(). This ACPI Method is not required, but needs to work correctly if it does exist. Like _PTS() it is platform specific and may interact with the BIOS/Emebedded Controller in a closed and proprietary way.
  • PM1a/b register settings from FADT. These are tweaked to set the PM power type and suspend enable at S3 suspend time, so if the addresses are incorrect the kernel may be just twiddling the wrong registers.

These can be dumped out using one of the following:

  sudo acpidump > acpidump.dat
  acpixtract -sFACP acpidump.dat
  iasl -d FACP.dat

and the edit FACP.dsl and check that the PM1A/PM1B event and control blocks look sane. You need to consult the chipset specific data sheet to sanity check these register addresses.

Suspend Hot Key

Pressing the suspend hot key should in theory trigger a suspend. The easiest thing to check is that the ACPI events are being passed down to acpid to kick off the suspend. Use the following to spy on ACPI events:

sudo acpi_listen 

If this fails, then next thing to check is for any GPEs occurring. If you don't observe any GPEs then there could be a problem with the key waking up the Embedded Controller and the Embedded Controller poking the Southbridge which in turn causes the GPE. To observe the GPEs, use:

watch -n 1 cat /sys/firmware/acpi/interrupts/*

then press the button and see if any of the GPE event counters increment. If not, you have a button + Embedded Controller issue.

Finally, if acpi_events to occur then make sure pm_suspend is being run. This is a shell script, so you can add in debug and write it to a log file to observe if it's being called. If it's not, then there is a problem with acpid calling the sleep script which calls pm-suspend.

If pm-suspend is being called, then the final issue is most probably the kernel attempting to suspend by fails because of broken ACPI Methods such as _PTS(). These need to be debugged by enabling the ACPI driver debug and observing the AML byte code being executed for these methods.

Kernel Messages

The following power management kernel messages are a little terse and it's instructive to know what they really mean:

PM: Some devices failed to power down
  • Usually means dpm_suspend_noirq failed because some devices did not power down.

PM: Failed to prepare device
  • dpm_prepare() failed to prepare all non-sys devices for a system PM transition. The device should be listed in the error message.

PM: Some system devices failed to power down
  • sysdev_suspend failed because some system devices did not power down.

PM: can't read, PM: can't set
  • Testing suspend cannot set RTC

Long delays on Resume

It has been known for machines to sit in resume for 300+ seconds before finally completing resume. Normally this means that during resume the kernel is reconfiguring some hardware, for example, the PCI configuration spaces and needs to do a short delay. Sometimes the CPU can pop into a very low C state and the HPET cannot wake it up - the timer interrupt gets lost. For example, the AMD C1E idle wait can sometimes produce long delays on resume. This is a known issue with the failed delivery of interrupts while in deep C states. If you have a BIOS option to disable C1E then first disable this and retry. Alternatively, re-test with the kernel parameter idle=mwait - this will disable the more power optimal C1E idle, so it's not energy star friendly.

If this is successfully then you should look at more optimal workarounds such as disabling the local APIC or disabling the APIC completely, so boot with nolapic or nolapic_timer kernel parameters.

Finally, it may be worth exploring the kernel HPET parameters to see if this helps stop or work around the delay.

Kernel/Reference/S3 (last edited 2011-09-15 10:12:23 by alexhung)