At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).
Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.
45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).
On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).
pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)
Going up to buffer_size=20 * 1024 slows it down to 1,015 rows per second.
processed 61137 (933/s)
processed 82398 (1373/s)
Things to try:
Use a batch mutator for small inserts (FirstError, etc).
Cache the entire contents of ColumnFamilies read from inside the get_range() loop.
Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...
- tools/build_src_version_buckets.py processed 1549190 buckets at a rate of 1359/min when run from a retracer.
Thursday, June 13th, 2013
We realised that after running build_errors_by_release.py for 0.5-1 day, it was going to take an awfully long time to complete:
[10:41:45] <gnuoy> ev, the finfolk job claims to have processed 800000 somthings, how many somethings are there to be processed ? [10:42:49] <ev> gnuoy: 51 million
This is again because we're using an inner loop against a second column family. Reducing the script to just the OOPS CF, with the intention of loading FirstError into memory, brings the time down to about 13-14 hours (1080/s). Running that same script under pypy gives a modest improvement of 1197 reports per second.