KarmicBetterSuspendResume

Summary

We integrated suspend and resume apport reporting in Jaunty and got a ton of reports of Resume failures, We would like to improve suspend/resume experience in Karmic. Triaging the suspend resume bugs that got opened recently we have identified that most of the issues were related to modules that deal with graphics card drivers and ethernet/wireless drivers. In this session we intend to do a little bit of brain storming to improve suspend/resume in Karmic.

Release Note

The impact of changes that will be put in place as a result of this session should improve suspend/resume experience on Karmic.

Rationale

Users of Laptops and Netbooks almost never shutdown and power on their devices, they prefer to close the laptop lid and expect the laptops to suspend, and open the lid to resume. This saves the state of the various applications they were running before suspend, and resume using these application upon resume. This continues to be the use case for laptops and netbooks in spite of fast shutdown/reboot times in Jaunty.

Unresolved issues

We have over 600 bugs opened on suspend/resume for jaunty, although I was able to identify some offending drivers that are common to some of the bugs, we are not able to resolve all of them.

Summary of Discussion

There was discussion how things went in the Jaunty cycle when we enabled reporting on suspend/resume issues. There was an overwealming amount of bug reports and we were unable to cope and fairly respond to them in a timely fashion. This work did find several generic bugs but we were unable to satisfactorily find common themes. Some improvements were made over the cycle including better reporting and potential to use the hardware DB.

It was pointed out that Rafael had written an OLS paper on the improvements coming in this area.

There were discussions of what our goals should be in the Karmic cycle. We clearly want to continue to improve the suspend/resume experience and if nothing else our testing in Jaunty has shown we have a problem. Also we have a number of improvements in the pipe coming from KMS which we would like to evaluate against Jaunty's results. The direct Goals were:

  • no drivers needing removal
  • a review and cleanup of the pm-util scripts
  • better understanding of the Jaunty issues via data analysis
    • dupping of common problems together etc

A number of ideas were suggested:

  • investigate the debugging features and why they don't work
  • investigate leaving the console on longer
    • no_console_suspend how would this affect the experience
    • how does KMS affect this
  • /sys/power/pm_test file, tell it which level suspends to go down to
  • look at some Karmic testing, but more focused this time
    • look at very bad machines, and request specific testing on those
    • Can script updating bug reports from testers w/ request to re-test
    • there are many suspend resume bug fixes available in newer kernels,
    • expectation is that things or better, being able to measure that
    • has value
  • look at how we can help the normal people find out how to communicate
  • upstream
    • WIP for lp/bugzilla plugin for kernel bugzilla
  • are we upstreaming bugs where we are using suspend/resume remove of
  • module as a solution
    • we should be opening generic bugs for failures, with upstream bugs
    • linked each
    • we should be getting upstream kernel testing if possible too

We should also consider allowing users to opt in to reporting more information to make these reports better:

  • do you want to send data "i don't mind":
    • success reports
      • success needs data on the environment that leads to success
    • having something in launchpad, allowing us to take more testing
    • with them and reporting success etc

Design

  • Several design goals were discussed at UDS.
    • Not requiring removal of kernel modules.
      • One way to fix suspend resume issues that are a result of modules behaving badly is to remove them before suspend and install them after resume, there has been an objection to this method, suggesting that the correct way to handle these modules is to fix the issues with them or get the attention of maintainers of these modules to fix their behavior on suspend/resume. But there are certain modules that are in experimental/staging area in the kernel that causes suspend resume issues, and removing them before suspend fixes resumes problems. The modules need to be identified and maintainers notified of issues these modules have on resume. But some of these modules are not being actively developed anymore.
    • Review and clean up of pm-utils scripts
      • Perhaps as a first steps modifying the scripts such that we can isolate suspend issues from resume issues.
    • Data mining and analysis of Jaunty suspend resume bugs for clues.
    • Due to the varied nature of hardware on which these problems are seen, a wider community effort in debugging suspend/resume issues is called for.

Implementation

  • Kernel Modules
    • Debugging tip & tricks

      • Review the information that is already available on the wiki and make sure that the wiki is updated with good debugging tips & tricks information. The goal of the wiki is to enable victims of suspend resume problems, be able to debug the problems themselves and report a more complete bug report with lots of debug information. This will give kernel developers who triage these bugs, more arsenal to fix these bugs.

    • Identify problem areas in offending modules and fix modules or work problems upstream
      • There are several bugs in the staging/experimental directories that are in use, that are known misbehave on suspend/resume. Perhaps a pop-up balloon should warn users when they enable suspend resume feature that an experimental driver (broken drivers, certain wireless drivers and networking drivers that are known to break suspend/resume experience) is in use which could result in broken suspend/resume experience.
      • Track commonly used drivers that are known to break suspend/resume experience and report them to maintainers or fix them and work the patches upstream. This would require knowledge of the hardware and access to such hardware making this effort difficult for kernel maintainers to perform this task. A wider community involvement perhaps is a more practical solution. At the UDS John Len-ville suggested that he could help look at some of the wireless drivers that need attention wrt to their behavior on suspend and resume. Make an effort to get him involved in looking at some of the commonly used drivers that cause suspend/resume failures.
    • Greater community involvement in Debugging
      • One of the hurdles in working on suspend/resume problems is lack of access to hardware on which suspend resume is known to be broken. There is only so much debugging & fixing that can be done otherwise. A wider community effort in fixing suspend/resume problems is called for. Use social networking to get the word out, request help from community in identifying such hardware and see if they are willing to help in anyway. As mentioned earlier, better wiki pages will help community debug/fix suspend/resume issues.

    • Roll in new suspend/resume support from upstream
      • Look for patches that fix suspend/resume issues upstream, and make sure that they are integrated in the Karmic kernel. Some patches might not have made it to the Linus tree, but if they add value to our effort, port them over to the Karmic kernel and let the community test it.
    • X related suspend/resume
      • X relates suspend/resume issues are harder to fix without knowledge of the video hardware, we may need to rely on upstream to fix such issues. Tap into known upstream resources for Intel video drivers etc.
      • Help originators of bug reports communicate X related failures better, point at the X debugging wiki so that those reporting X related bugs know how to debug and provide detailed debugging information on the failure.
    • Measure impact of KMS on resolving suspend/resume issues
      • A suspend/resume experience session at the platform summit will give us a good measure of how much of an impact KMS has on resolving suspend/resume issues. Although this is a small subset of hardware, but the effort will be worthwhile. Also, announce/poll community experience with KMS, perhaps using brainstorm to collect poll data might be a good idea.
    • Leaving console on for longer duration.
      • TBD and discussed at platform sprint.
  • Data Mining and Analyzing suspend/resume bugs in Jaunty
    • Due to overwhelming number of these bugs, this task can be very time/resource consuming.
    • Use simple methods to connect common points of failure in drivers and subsystems, a simple shell scripts that walks though the attachments on each bug and collects common data points might be a good start.
    • Gather information on hardware that are known to cause problems with Suspend/Resume.
    • Dupe bugs that are known to be caused by single point of failure together.
  • pm-utils refresh
    • Karmic already uses the latest pm-utils package, review the code to see if any clean up can be done. Scott might be able to identify areas where improvements can be made, especially in ubuntu specific pieces of pm-utils scripts.
    • Identify problems areas in the pm-utils scripts and propose improvements
    • Better reporting of suspend and resume points of failure
    • community involvement in review of pm-utils scripts to better handle problem hardware.

KernelTeam/Specs/KarmicBetterSuspendResume (last edited 2009-08-19 12:38:46 by kohlrabi)