CassandraQueries
2520
Comment:
|
2757
|
Deletions are marked like this. | Additions are marked like this. |
Line 17: | Line 17: |
* http://pycassa.github.io/pycassa/api/pycassa/batch.html | |
Line 18: | Line 19: |
* This will not work: * https://github.com/pycassa/pycassa/issues/170 * http://stackoverflow.com/questions/14730137/cassandra-randompartitioner-and-full-table-scans |
At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).
Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.
45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).
On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).
Test configuration:
pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)
From BlueFin |
From finfolk |
processed 61137 (933/s) |
processed 82398 (1373/s) |
Things to try:
Use a batch mutator for small inserts (FirstError, etc).
Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...
ErrorTracker/CassandraQueries (last edited 2013-06-13 10:55:13 by ev)