SevenStepTroubleShootingProcess

Problem

From time to time you come across situations in life that are problems, be it your car won't start, or some software is buggy. To figure out the issue you need to follow a process to understand the problem.

Process

This process called "troubleshooting" is known by different names when applied to different things: diagnosing, debugging, problem solving, experimentation, etc. While the names differ the process is the same, in specific to codify it I call it the Seven Step Troubleshooting process outlined below.

Some of the advise given in this document will be specific to Ubuntu and Computer Software/Hardware bugs, however the general process can be applied to most any problem in life.

Remember this is going to be an iterative process and you have to follow several of the steps over and over again with great attention to detail. Remember the detail you miss might be the detail you needed to quickly solve the problem.

Another critical thing to remember you only change one thing at a time EVER and then you retest. When you think you have fixed the problem and testing bears this out and you can explain exactly what was the root cause and how exactly what you have done has fixed the bug then you can finish the process. Also once you think you have fixed the bug, make sure you removed all changes you made to localize the bug only leaving the final root cause fix.

Seven Step Troubleshooting Process

  • 1) Prepare
    2) Document Symptoms (i) iterative
    3) Analyze (i) iterative
    4) Change (i) iterative
    5) Fix (OK)
    6) ReTest
    7) Document Resolution

1) Prepare

  • Once you realize there is a problem, stop, step away, that's right do nothing. Don't touch a thing. This is critical. Think about what you are seeing, what you did to get here. If you think you will need some hardware to work the issue get it now. If necessary take a short break so you can focus in on the issue.

    Now that you have overcome your initial response to touch things, move on to step 2.

2) Document Symptoms

(i) This step is iterative and may be done many times.

  • First we are going to document how we arrived here, then we will document the actual symptoms, you need both bits of data, they interrelate.

    The first time you find a problem document everything you remember doing to get to this point, no matter how small. It's important. Go even so far as to document things you think you might have done even if not sure, just document what you know is fact and what may not be. The next times through the problem you will want to compare and document if any of the steps change, as that will be very important data to know.

    Now document everything about the problem that you can see, again without touching anything. DON'T change the status of the thing you are trying to troubleshoot THIS IS CRITICAL. Remember the more data you gather the quicker you will locate the root cause.

    In Ubuntu some of the data you might need is the hardware type, BIOS version number and date, all of the information about the machine (how much memory etc) version of OS, kernel version, packages installed on the machine, the package version of the package effected, etc. In other words everything another person would need to duplicate the problem on another box of the same type. Some of this data might need to be gathered on later runs of the data gathering process, that is OK but do make sure the data is gathered.

    While you are iterating over the problem once you are done documenting the latest change another thing to remember is if you made a change and the change you made does not change the problem, remove the change, don't leave it in to only add a new problem once you find and fix the root cause of this problem. This happens to many folks, they forget to remove "hacks" they added to localize the problem and later they have new problems to fix.

3) Analyze

(i) This step is iterative and may be done many times.

  • A key part of the Seven Step troubleshooting process is you only get to make one change at a time, so you analyze your data with an eye on how you can eliminate possibilities. If you make more then a single change at a time it is going to make it impossible to understand what is going on when symptoms change.

The purpose of analyzing this data is to figure out how to localize the problem, we want to figure out a test that will eliminate sections of code/hardware from the problem.

  • Read the data you have gathered to this point, analyze it and try to understand as much as you can at this point.

    In Ubuntu if you have not yet opened a bug now is the time to do so, if you already have a bug open add the gathered data to it. Include ALL of the data you have gathered. If you have a thought about a package to log the bug against log it, if you find out later you in the process you were incorrect you can change the package it's logged against as needed.

    Part of the process is knowing if the bug is repeatable or intermittent, so you will need to reboot and start again and see if you can recreate the bug. If the bug is intermittent but can be repeated periodically you need to record just how often you can recreate it and how many times you can't.

    If this is your first pass through the problem you may want to instrument the hardware to better understand the problem, that's fine, however that counts as a change so once you figure out what you need attached or changed to gather more information jump to step 4.

    Now you want to divide an conquer, based upon what you know from analyzing your data how can you eliminate part of the system? You can only make a single change so look at the entire problem and see what is the best SINGLE change you can make to divide the problem in half. Once you understand your data and have chosen what you want to change goto to step 4.

    If after analyzing the data you have arrived at the root cause of the problem and you know how to fix the bug, remove all of the changes you made to localize the problem, apply the fix, and jump to step 2 again one more time. If another iteration shows you have indeed fixed the problem jump to step 5.

4) Change

(i) This step is iterative and may be done many times.

  • Make a single change, whether that be instrumentation or a change to try and zone in on the root cause of the problem. Once you have made your change go back to step 2, and run through the process again.

5) Fix

  • If you are positive that you have identified the root cause, and you have the correct fix to the problem, package it up and install it on a clean system and jump to step 6.

6) ReTest

  • Time to test once last time, make sure it's really fixed, if it is proceed on to step 7, if not back to step 2 and start again. If it worked for you, see if you can get someone else with the same configuration to test it for you. Post information in the bug on the packages, files, whatever so the tester can replicate and validate your assumptions.

7) Document Resolution

  • You are not done until document everything you have done, and applied a patch or package fix for the bug. Until any technical person could read, understand and replicate what you have done you are not finished. Also note any testers other than yourself that validated your fix.

SevenStepTroubleShootingProcess (last edited 2008-12-16 17:52:43 by pool-96-226-232-136)