LaptopTestingDataScraper

I found that the Ubuntu Laptop Testing Team pages had lots of data that I wanted, but not presented in a way that was convenient for someone looking to find a laptop with features X,Y & Z, as I was. The following is a python script to grab data from the laptop testing pages and allow you to query the information as a sqlite database. This script does three things:

  1. Scrape the laptop testing information from the wiki
  2. Dump that information into csv files suitable for import into a database.
  3. Load that information into a SQLite database which can easily be queried to find, for example, laptops with support for a given feature.

Running the script dumps you to an interactive python shell where you can inspect the sqlite DB with some convenience methods. For example,

tom@lacky ubuntu_laptop_info :) $ ./scrape_laptop_info.py
[... will parse pages and print instructions for usage ...]
>>> laptops_for_criteria(Sleep='Yes',Hibernates='Yes',_3DAcceleration='Yes')
Make    Model   Distro
Dell    Latitude 110L   Breezy (current stable)
Dell    Latitude 110L   Dapper (current development)
Dell    Latitude D410   Breezy (current development)
HP      NC4200  Breezy (current stable)
IBM/Lenovo      ThinkPad X41    Breezy Final (current stable)
IBM/Lenovo      ThinkPad X41    Dapper Flight 2 (current development)
Toshiba Tecra A2        Breezy (current development)
Acer    TravelMate 4402WLMi     Hoary (current stable)
Acer    TravelMate 4402WLMi     Breezy (current development)
Compaq  Presario 2701EA Breezy (current stable)
Dell    Latitude D400   Breezy
HP      Pavilion DV1245EA       Breezy
HP      NC6120  Breezy (current stable)
IBM/Lenovo      Thinkpad X31    Breezy (current stable)
IBM/Lenovo      Thinkpad X41    Breezy
Panasonic       CF-W2   Hoary (current stable)
Panasonic       CF-W2   Breezy (current development)
Panasonic       CF-Y2   Hoary (current stable)
Panasonic       CF-Y2   Breezy (current development)
Panasonic       CF-W4   Breezy (current development)
Panasonic       CF-Y4   Breezy (current development)
Sony    Vaio VGN-FS315E Breezy (current stable)
Zitech  Z223    Breezy (current development)
>>> display_info('HP','Pavilion DV1245EA')
distro: Breezy
make: HP
model: Pavilion DV1245EA
Screen: Yes
_3DAcceleration: Yes
ExternalMonitorWorks: Yes
ExternalMonitorMirrors: Yes
ExternalMonitorExtendDesktop: yes(1)
BatteryDetected: Yes
Hibernates: Yes
Sleep: Yes
SoundWorks: Yes
TouchpadDoubletapDoubleClick: Yes
TouchpadScrollDownSide: Yes

distro: Dapper (current development)
make: HP
model: Pavilion DV1245EA
Screen: Yes
_3DAcceleration: Yes
ExternalMonitorWorks: Yes
ExternalMonitorMirrors: Yes
ExternalMonitorExtendDesktop: Untested
BatteryDetected: Yes
Hibernates: Untested
Sleep: Yes
SoundWorks: Yes
TouchpadDoubletapDoubleClick: Yes
TouchpadScrollDownSide: Yes

Hope this is helpful! -- ThomasMHinkle

Please ThomasMHinkle fix the indention errors before copying and paste the script. Example: while not hasattr(next,'name'): next = next.next

Here is the script in its entirety. Some brief documentation will automatically be displayed when you run it.

import urllib, exceptions, csv, re
import BeautifulSoup

# Written by Thomas M. Hinkle 2006 - feel free to copy, modify,
# re-use, and share as you like.
#
# This script requires BeautifulSoup
# http://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup.py
#
# USAGE: Running this script will scrape the laptop testing team
# pages, create 2 csv files with the data, and dump you on a python
# commandline from which you can inspect a sqlite database of the
# laptop info.
#
# If you don't have sqlite, this script will create the 2 csv files
# and then fail. You can of course still make use of those csv files
# as you like.
#
# If you don't have BeautifulSoup, this script will fail utterly.


wikipage = 'https://wiki.ubuntu.com/LaptopTestingTeam'

first_not_let = re.compile('^[^A-Za-z_]')
non_word = re.compile('\W')

def name_to_db_column (n):
    n = non_word.sub('',n)
    if first_not_let.match(n): return '_'+n
    else: return n

class FormatChange (exceptions.Exception):
    pass

def get_string_from_element (el):
    if el.string: return el.string
    else:
        s = ''
        for c in el.contents:
            s += get_string_from_element(c)
        return s

def get_soup (url):
    sock = urllib.urlopen(url)
    return BeautifulSoup.BeautifulSoup(sock.read())

def process_link_table (tbl):
    rows = tbl.fetchChildren('tr')
    rows = rows[1:]
    lst = []
    for r in rows:
        ret = {}
        cells = r.fetchChildren('td')
        ret['make'] = get_string_from_element(cells[0].findChild('p')).strip()
        ret['model'] = get_string_from_element(cells[1].findChild('p')).strip()
        ret['submitter'] = get_string_from_element(cells[2].findChild('a')).strip()
        ret['link'] = cells[3].findChild('a').attrs[0][1]
        lst.append(ret)
    return lst

def grab_list (url=wikipage):
    """Return 2 lists of webpages with laptop information.

    The first list will be the canonical supplied list, presumably
    more reliable.

    The second list will be community supplied.
    """
    
    scraper = get_soup(url)
    try: header = scraper.fetch(name='h3',text='Canonical Supplied')[0]
    except IndexError:
        raise FormatChange("Can't find <h3>Canonical Supplied</h3>")
    next = header.next
    while not hasattr(next,'name'): next = next.next
    if next.name != 'div': raise FormatChange("""We expect a div after <h3>Canonical Supplied</h3>""")
    header = scraper.fetch(name='h3',text='Community Supplied')[0]
    tbl = next.findChild('table')
    canon_supplied = process_link_table(tbl)
    for d in canon_supplied: d['Canonical supplied'] = True
    # Community supplied
    try:
        header = scraper.fetch(name='h3',text='Community Supplied')[0]
    except IndexError:
        raise FormatChange("Can't find <h3>Community Supplied</h3>")
    next = header.next
    while not hasattr(next,'name'): next = next.next
    if next.name != 'div': raise FormatChange("""We expect a div after <h3>Community Supplied</h3>""")
    tbl = next.findChild('table')
    community_supplied = process_link_table(tbl)
    for d in community_supplied: d['Canonical supplied'] = False
    canon_supplied.extend(community_supplied)
    return canon_supplied
    
class InfoGetter:

    def __init__ (self, url=wikipage):
        self.base = url

    def get_laptop_list (self):
        self.data = grab_list(self.base)
        self.laptop_columns = ['make','model','submitter']
        self.record_columns = ['distro']

    def get_info (self):
        for d in self.data:
            d['info']=self.process_laptop_page(d)

    def process_laptop_page (self,d):
        print 'Fetching info for ',d['make'],d['model'],d['submitter']
        try:
            soup = get_soup(urllib.basejoin(self.base,d['link']))
        except:
            print 'Problem fetching',d['link']
            import traceback; traceback.print_exc()
            self.data.remove(d)
            return
        for hwtable in soup.fetch('table'):
            works = hwtable.first('strong',text='Works?')
            works_row = works.findParent('tr')
            if not works_row: continue
            # Find cell that represents the X position of our
            # yes|no values in the table
            for n,c in enumerate(works_row.fetch('td')):
                if c.findChild('strong',text='Works?'):
                    start_at = n
                    break
            distr_row = works_row.findNext('tr')
            distr_cells = distr_row.fetch('td')
            distros = [
                get_string_from_element(c.first('p')).strip().strip('in ').strip('?')
                for c in distr_cells
                ]
            d['distros'] = distros
            rows = distr_row.fetchNextSiblings('tr')
            for r in rows:
                try:
                    values = [get_string_from_element(c.first('p')).strip()
                              for c in r.fetch('td')]
                except AttributeError: # if strip fails because we get None instead of string
                    values = []
                if len(values) < 3: 
                    continue
                # Name of our field is always the first value
                name = values[0]; values = values[1:] #reverse pop!
                name = name.strip().title()
                # Check if we already have this column name
                if name not in self.record_columns:
                    # If not, we better make sure this isn't just a
                    # slight variation...  otherwise, we get duplicate
                    # columns in our SQL table.
                    db_name = name_to_db_column(name).lower()
                    for n in self.record_columns:
                        if name_to_db_column(n).lower()==db_name:
                            print 'Using ',n,'for',name
                            name = n
                            break
                    else: self.record_columns.append(name)
                # Bug# is always our last column
                bugno = values.pop()
                ret = {}
                for n,v in enumerate(values[start_at-1:]):
                    # We're slicing at start_at - 1 because we already
                    # stripped the title off the start.
                    if n >= len(distros):
                        print 'Unusual value',name,v
                        print 'This is value ',n
                        print 'We only have ',len(distros),'distros:',distros
                        break
                    distro = distros[n]
                    if v.lower().strip('!, ;.') in ['yes','y','true','works','work','function']:
                        ret[distro] = 'Yes'
                        #ret['any'] = 'Yes'
                    elif v.lower() in ['no','n','false']:
                        ret[distro] = 'No'
                    else:
                        #print 'Strange value for ',name,distro,':',v
                        ret[distro] = v
                        #if not ret['any']: ret['any'] = v
                d[name] = ret

    def output_csv (self, base_filename='ubuntu_laptop_compatability_db'):
        laptops = file(base_filename + '_laptops.csv','w')
        laptop_writer = csv.writer(laptops)
        laptop_writer.writerow(['LaptopID']+self.laptop_columns)
        records = file(base_filename + '_records.csv','w')
        record_writer = csv.writer(records)
        record_writer.writerow(['Laptop']+self.record_columns)
        laptop_no = 0
        for d in self.data:
            # We give each laptop a unique ID, which is then used
            # to join the laptop table with the records table
            laptop_fields = [d.get(c,'') for c in self.laptop_columns]
            #laptop_fields = [str(laptop_no)] + ['"%s"'%f for f in laptop_fields]
            laptop_writer.writerow([laptop_no] + laptop_fields)
            distros = d.get('distros',[])
            if not distros: print 'No records for ',d
            # Write one record per distro
            for dist in distros:
                record_fields = [laptop_no,dist]
                for r in self.record_columns[1:]: #ignore dist
                    val = d.get(r,None)
                    if not val:
                        record_fields.append('')
                    else:
                        record_fields.append('%s'%val.get(dist,''))
                record_writer.writerow(record_fields)
            laptop_no += 1
        laptops.close()
        records.close()
        print "Wrote data to two CSV files:"
        print "Laptop info: ",base_filename + '_laptops.csv'
        print "Record info: ",base_filename + '_records.csv'

class LaptopDatabase:

    """A simple interface to a SQLite database of Ubuntu Laptop information.

    Our SQL column names are going to be the ubuntu fields stripped of
    spaces, numbers and punctuation.

    For example,

    Sleep? becomes Sleep
    External
    3D Acceleration? becomes _3DAcceleration
    """

    DISPLAY_IN_ROWS_LENGTH = 6
    
    def __init__ (self,
                  db_file='ubuntu_laptop_db.db'
                  ):
        import sqlite
        self.connection = sqlite.connect(db_file)
        self.cursor = self.connection.cursor()

    def list_laptop_fields (self):
        self.cursor.execute('SELECT * FROM sqlite_master WHERE name="laptops"')
        return ld.cursor.fetchone()[4]

    def list_record_fields (self):
        self.cursor.execute('SELECT * FROM sqlite_master WHERE name="records"')
        return ld.cursor.fetchone()[4]

    def load_tables (self, base_filename='ubuntu_laptop_compatability_db'):
        laptop_file = base_filename + '_laptops.csv'
        records_file = base_filename + '_records.csv'
        self.load_from_csv_file(laptop_file,'laptops')
        self.load_from_csv_file(records_file,'records')
        self.connection.commit()

    def load_from_csv_file (self,filename,table_name):
        infile = file(filename,'r')
        reader = csv.reader(infile)
        headers = reader.next()
        # Column names must be plain old letters
        headers = [name_to_db_column(h) for h in headers]
        sql = "CREATE TABLE %s ("%table_name
        idrow = headers[0]
        rows = ['"%(idrow)s" int'%locals()]
        rows += ['"%s" str'%h for h in headers[1:]]
        sql += ", ".join(rows)
        sql += ")"
        try:
            self.cursor.execute(sql)
        except:
            print 'Error executing following SQL'
            print sql
            print 'Sorry...'
            raise
        for row in reader:
            sql = "INSERT INTO %(table_name)s"%locals()
            sql += " (" + ", ".join(['"%s"'%h for h in headers]) + ') '
            sql += "VALUES "
            sql += "("
            vals = [int(row[0])] + row[1:]
            sql += ", ".join(len(vals)*["%s"])
            sql += ")"
            self.cursor.execute(sql,vals)
        
    def execute (self, sql):
        self.cursor.execute(sql)
        return self.cursor.fetchall()

    def display_results (self, rows, headers=None):
        """Display result rows, possibly with headers.

        If we have headers, there must be a header for each column.
        
        If we have DISPLAY_IN_ROW_LENGTH columns or less, we will display
        column TAB column TAB column

        Otherwise, we'll display

        Header: Column
        Header: Column
        ...
        """
        if len(rows)>0 and len(rows[0])< self.DISPLAY_IN_ROWS_LENGTH:
            if headers:
                print "\t".join(headers)
            for r in rows:
                print "\t".join(r)
        else:
            for r in rows:
                for n,c in enumerate(r):
                    if headers: print "%s:"%headers[n],
                    print c
                print
            
    def laptops_for_criteria (self, **criteria):
        """Display list of laptops that match ALL criteria.

        Criteria are keyword arguments of the form
        column="value"

        If criteria contains a %, we use LIKE instead of =

        For example,

        get_laptops_for_criteria(Sleep="Yes%") will execute the SQL
        SELECT... WHERE Sleep LIKE "Yes%"

        get_laptops_for_criteria(Sleep="Yes") will execute the SQL
        SELECT... WHERE Sleep="Yes%"
        """

        self.display_results(self.get_laptops_for_criteria(**criteria),
                             headers=['Make','Model','Distro'])


    def __make_where_statement__ (self, **criteria):
        where_statements = []
        for k,v in criteria.items():
            k = name_to_db_column(k)
            if v.find('%')>0:
                where_statements.append('"%s" LIKE "%s"'%(k,v))
            else:
                where_statements.append('"%s"="%s"'%(k,v))
        return " AND ".join(where_statements)

    def get_laptops_for_criteria (self, **criteria):
        """Get laptops that match ALL criteria.

        Criteria are keyword arguments of the form
        column="value"

        If criteria contains a %, we use LIKE instead of =

        For example,

        get_laptops_for_criteria(Sleep="Yes%") will execute the SQL
        SELECT... WHERE Sleep LIKE "Yes%"

        get_laptops_for_criteria(Sleep="Yes") will execute the SQL
        SELECT... WHERE Sleep="Yes%"
        """
        base = """SELECT laptops.make,laptops.model,records.distro
            FROM laptops JOIN records on laptops.LaptopID=records.Laptop
            """
        where_statement = self.__make_where_statement__(**criteria)
        if where_statement:
            base += "WHERE " + where_statement
        return self.execute(base)

    def get_hibernate_and_sleep (self):
        return self.get_laptops_for_criteria(**{'Hibernates':'Yes',
                                              'Sleep':'Yes'})

    def get_almost_hibernate_and_sleep (self):
        return self.get_laptops_for_criteria(**{'Sleep':'Yes%',
                                              'Hibernates':'Yes%'}
                                             )

    def get_ideal_laptops (self):
        return self.get_laptops_for_criteria(
            **{
            'Sleep':"Yes%",
            'Hibernates':"Yes%",
            '_3DAcceleration':"Yes%",
            'ExternalMonitorExtendDesktop':"Yes%",
            }
            )

    def list_models (self, make=None):
        self.display_results(
            (make
             and 
             self.execute('SELECT make,model from laptops where make="%s"'%make)
             or
             self.execute('SELECT make,model from laptops')
             ),
            ['Make','Model']
            )

    def display_info (self, make=None, model=None, **criteria):
        fields = ["distro",
                  "make",
                  "model",
                  "Screen",
                  "_3DAcceleration",
                  "ExternalMonitorWorks",
                  "ExternalMonitorMirrors",
                  "ExternalMonitorExtendDesktop",
                  #"ExternalMonitorDvi",
                  "BatteryDetected",
                  "Hibernates",
                  "Sleep",
                  #"SuspendToRamSleep",
                  #"SuspendToDisk",
                  #"DimBlankMonitor",
                  "SoundWorks",
                  "TouchpadDoubletapDoubleClick",
                  "TouchpadScrollDownSide",
                  ]
        sql = "SELECT " + ", ".join(['"%s"'%f for f in fields])
        sql += " FROM records JOIN laptops ON laptops.LaptopID=records.Laptop"
        if make: criteria['Make']=make
        if model: criteria['Model']=model
        if criteria:
            sql += ' WHERE ' + self.__make_where_statement__(**criteria)
        #print 'Execute and display',sql
        self.display_results(self.execute(sql),
                             headers=fields)


def show_help():
    print """Laptop database ready for you inspection"""
    print """For example... """
    print """list_models() will list all models. list_models("Dell") will list all dell models."""
    print 'display_info("Toshiba") will show a subset of key info for all Toshiba laptops'
    print 'display_info("Toshiba","Satellite L10-101") will show info for that particular laptop.'
    print "display_info('Toshiba',Sleep='Yes') will show info for all Toshiba models that support"
    print "                                    sleep."
    print "display_info('Dell','Lat%') will show info for all Dell Latitude models."
    print "laptops_for_criteria(Sleep='Yes') returns a list of laptops that Sleep"
    print "laptops_for_criteria(Make='Dell',Sleep='Yes') returns a list of dell laptops"
    print "                                              that sleep."
    print 'list_record_fields() will give you a sense of what the DB columns are.'
    print
    print 'type show_advanced_help() for more advanced help.'
    print
    print 'Type show_help() to repeat this message at any point'
    print 'Type Control-D to exit (or Control-Z on Windows)'

def show_advanced_help():
       print '''ld contains your LaptopDatabase instance. 
dir(ld) shows you all ld methods you might want to play with. 

help(laptops_for_criteria) will explain the
syntax for laptops_for_criteria in more detail.


SQL Details:
ld.execute(SQL) will execute a SQL statement and print the result rows.
ld.cursor is the cursor directly if you want to manually handle 
          ld.cursor.execute() and ld.cursor.fetchone/ld.cursor.fetchall

list_record_fields() and list_laptop_fields() will give a list of the database 
columns in each of the two tables. Usually, you'll want to use:

SELECT columns FROM laptops JOIN records ON laptops.LaptopID=records.Laptop

Type show_advanced_help() to repeat this message.
Type show_help() for general help.
Type Control-D to exit.
'''


if __name__ == '__main__':
    import os, os.path
    if not os.path.exists('ubuntu_laptop_db.db'):
        if (not os.path.exists('ubuntu_laptop_compatability_db_laptops.csv')
            or
            not os.path.exists('ubuntu_laptop_compatability_db_records.csv')
            ):
            print "Grabbing information from Ubuntu wiki"
            ig = InfoGetter()
            print "Get list of laptop pages"
            ig.get_laptop_list()
            print "Scrape info from laptop pages"
            ig.get_info()
            print "Write csv files"
            ig.output_csv()
        else:
            print "Using previously existing csv files"
        try:
            ld = LaptopDatabase()
        except:
            print 'You have laptop info in dictionaries in ig'
            print 'You have two csv files with all your data in the current directory'
            print 'Unforunately, we had trouble loading a SQL database'
            raise
        print 'Loading laptop database from csv files'
        print '(Delete those files if you want to regenerate them)'
        ld.load_tables()
    else:
        print 'Loading previously existing database from ubuntu_laptop_db.db'
        print '(Move or delete this DB if you want to regenerate it)'
        ld = LaptopDatabase()
    ld = LaptopDatabase()
    execute = ld.execute
    laptops_for_criteria = ld.laptops_for_criteria
    cursor = ld.cursor
    connection = ld.connection
    list_record_fields = ld.list_record_fields
    list_laptop_fields = ld.list_laptop_fields
    display_info = ld.display_info
    list_models = ld.list_models
    print
    show_help()


CategoryLaptop

LaptopTestingDataScraper (last edited 2010-02-07 00:21:39 by newacct)