CassandraQueries
Size: 2757
Comment:
|
Size: 3072
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 11: | Line 11: |
* Going up to {{{buffer_size=20 * 1024}}} slows it down to 1,015 rows per second. | |
Line 18: | Line 19: |
* Cache the entire contents of Column``Families read from inside the get_range() loop. | |
Line 22: | Line 24: |
== Additional scripts == * tools/build_src_version_buckets.py processed 1549190 buckets at a rate of 1359/min when run from a retracer. |
At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).
Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.
45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).
On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).
Test configuration:
pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)
Going up to buffer_size=20 * 1024 slows it down to 1,015 rows per second.
From BlueFin |
From finfolk |
processed 61137 (933/s) |
processed 82398 (1373/s) |
Things to try:
Use a batch mutator for small inserts (FirstError, etc).
Cache the entire contents of ColumnFamilies read from inside the get_range() loop.
Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...
Additional scripts
- tools/build_src_version_buckets.py processed 1549190 buckets at a rate of 1359/min when run from a retracer.
ErrorTracker/CassandraQueries (last edited 2013-06-13 10:55:13 by ev)