At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).
Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.
45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).
On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).
pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)
Going up to buffer_size=20 * 1024 slows it down to 1,015 rows per second.
processed 61137 (933/s)
processed 82398 (1373/s)
Things to try:
Use a batch mutator for small inserts (FirstError, etc).
Cache the entire contents of ColumnFamilies read from inside the get_range() loop.
Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...
- tools/build_src_version_buckets.py processed 1549190 buckets at a rate of 1359/min when run from a retracer.