ErrorTracker/CassandraQueries

CassandraQueries

At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).

Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.

45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).

On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).

Test configuration:

pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)
- Going up to buffer_size=20 * 1024 slows it down to 1,015 rows per second.

From BlueFin

From finfolk

processed 61137 (933/s)
processed 132453 (1004/s)
processed 213929 (1101/s)
processed 295449 (1157/s)
processed 376930 (1186/s)
processed 468641 (1216/s)
processed 550162 (1229/s)
processed 631666 (1239/s)
processed 713185 (1246/s)
processed 794678 (1254/s)
processed 876226 (1261/s)
processed 957740 (1266/s)
processed 1039262 (1268/s)
processed 1120755 (1271/s)
processed 1202274 (1270/s)
processed 1293981 (1276/s)
processed 1375494 (1278/s)
processed 1456979 (1280/s)
processed 1538473 (1283/s)
processed 1620010 (1285/s)
processed 1701522 (1286/s)
processed 1783011 (1284/s)
processed 1864538 (1283/s)
processed 1946040 (1284/s)
processed 2037762 (1287/s)
processed 2119265 (1287/s)
processed 2190588 (1283/s)
processed 2272100 (1284/s)
processed 2353586 (1282/s)
processed 2435095 (1282/s)
processed 2516581 (1280/s)

processed 82398 (1373/s)
processed 163023 (1341/s)
processed 254688 (1344/s)
processed 336191 (1332/s)
processed 427882 (1342/s)
processed 509401 (1334/s)
processed 590923 (1335/s)
processed 662248 (1316/s)
processed 743751 (1317/s)
processed 835453 (1324/s)
processed 927166 (1335/s)
processed 1008680 (1335/s)
processed 1100397 (1342/s)
processed 1181903 (1338/s)
processed 1263406 (1338/s)
processed 1344915 (1339/s)
processed 1426415 (1336/s)
processed 1518115 (1336/s)
processed 1609819 (1339/s)
processed 1701522 (1339/s)
processed 1793209 (1342/s)
processed 1884909 (1344/s)

Things to try:

Use a batch mutator for small inserts (FirstError, etc).
- http://pycassa.github.io/pycassa/api/pycassa/batch.html
Cache the entire contents of ColumnFamilies read from inside the get_range() loop.
Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...
- This will not work:
  - https://github.com/pycassa/pycassa/issues/170
  - http://stackoverflow.com/questions/14730137/cassandra-randompartitioner-and-full-table-scans

Additional scripts

tools/build_src_version_buckets.py processed 1549190 buckets at a rate of 1359/min when run from a retracer.

Thursday, June 13th, 2013

We realised that after running build_errors_by_release.py for 0.5-1 day, it was going to take an awfully long time to complete:

[10:41:45] <gnuoy>       ev, the finfolk job claims to have processed  800000 somthings, how many somethings are there to be processed ?
[10:42:49] <ev>  gnuoy: 51 million

This is again because we're using an inner loop against a second column family. Reducing the script to just the OOPS CF, with the intention of loading FirstError into memory, brings the time down to about 13-14 hours (1080/s). Running that same script under pypy gives a modest improvement of 1197 reports per second.

ErrorTracker/CassandraQueries (last edited 2013-06-13 10:55:13 by ev)

Ubuntu Wiki

CassandraQueries

Additional scripts

Thursday, June 13th, 2013