CassandraQueries
Size: 2842
Comment:
|
Size: 2931
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 19: | Line 19: |
* Cache the entire contents of Column``Families read from inside the get_range() loop. |
At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).
Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.
45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).
On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).
Test configuration:
pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)
Going up to buffer_size=20 * 1024 slows it down to 1,015 rows per second.
From BlueFin |
From finfolk |
processed 61137 (933/s) |
processed 82398 (1373/s) |
Things to try:
Use a batch mutator for small inserts (FirstError, etc).
Cache the entire contents of ColumnFamilies read from inside the get_range() loop.
Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...
ErrorTracker/CassandraQueries (last edited 2013-06-13 10:55:13 by ev)