CassandraQueries

Differences between revisions 2 and 3
Revision 2 as of 2013-04-16 10:22:04
Size: 1526
Editor: ev
Comment:
Revision 3 as of 2013-04-16 10:31:28
Size: 1721
Editor: ev
Comment:
Deletions are marked like this. Additions are marked like this.
Line 50: Line 50:

Things to try:
 * Use a batch mutator for small inserts (First```Error, etc).
 * Multithreaded / multiprocess on {{{get_range(start='', finish='b'), get_range(start='b', finish='c'), ...}}}

At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).

Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.

45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).

On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).

Test configuration:

  • pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)

From BlueFin:

processed 61137 (933/s)
processed 132453 (1004/s)
processed 213929 (1101/s)
processed 295449 (1157/s)
processed 376930 (1186/s)
processed 468641 (1216/s)
processed 550162 (1229/s)
processed 631666 (1239/s)
processed 713185 (1246/s)
processed 794678 (1254/s)
processed 876226 (1261/s)
processed 957740 (1266/s)
processed 1039262 (1268/s)
processed 1120755 (1271/s)
processed 1202274 (1270/s)
processed 1293981 (1276/s)
processed 1375494 (1278/s)
processed 1456979 (1280/s)
processed 1538473 (1283/s)
processed 1620010 (1285/s)
processed 1701522 (1286/s)
processed 1783011 (1284/s)
processed 1864538 (1283/s)
processed 1946040 (1284/s)
processed 2037762 (1287/s)
processed 2119265 (1287/s)
processed 2190588 (1283/s)
processed 2272100 (1284/s)
processed 2353586 (1282/s)
processed 2435095 (1282/s)
processed 2516581 (1280/s)

From finfolk:

Things to try:

  • Use a batch mutator for small inserts (First`Error, etc).

  • Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...

ErrorTracker/CassandraQueries (last edited 2013-06-13 10:55:13 by ev)