Diff for "ErrorTracker/CassandraQueries"

CassandraQueries

Differences between revisions 5 and 7 (spanning 2 versions)

At the time of writing (April 16th, 2013), there are nearly 45 million reports in the database (44,799,616).

Iterating the OOPS column family alone (where the reports live), with no other lookups and with the data passing through a single node takes approximately 10 hours.

45 million reports / ~1,285 reports per second = 35,019 seconds (9.7 hours).

On production (finfolk) it takes approximately 9.4 hours (1,330 reports per second).

Test configuration:

pycassa.ColumnFamily(pool, 'OOPS').get_range(columns=['SystemIdentifier', 'DistroRelease'], buffer_size=10 * 1024, include_timestamp=True)

From BlueFin

From finfolk

processed 61137 (933/s)
processed 132453 (1004/s)
processed 213929 (1101/s)
processed 295449 (1157/s)
processed 376930 (1186/s)
processed 468641 (1216/s)
processed 550162 (1229/s)
processed 631666 (1239/s)
processed 713185 (1246/s)
processed 794678 (1254/s)
processed 876226 (1261/s)
processed 957740 (1266/s)
processed 1039262 (1268/s)
processed 1120755 (1271/s)
processed 1202274 (1270/s)
processed 1293981 (1276/s)
processed 1375494 (1278/s)
processed 1456979 (1280/s)
processed 1538473 (1283/s)
processed 1620010 (1285/s)
processed 1701522 (1286/s)
processed 1783011 (1284/s)
processed 1864538 (1283/s)
processed 1946040 (1284/s)
processed 2037762 (1287/s)
processed 2119265 (1287/s)
processed 2190588 (1283/s)
processed 2272100 (1284/s)
processed 2353586 (1282/s)
processed 2435095 (1282/s)
processed 2516581 (1280/s)

processed 82398 (1373/s)
processed 163023 (1341/s)
processed 254688 (1344/s)
processed 336191 (1332/s)
processed 427882 (1342/s)
processed 509401 (1334/s)
processed 590923 (1335/s)
processed 662248 (1316/s)
processed 743751 (1317/s)
processed 835453 (1324/s)
processed 927166 (1335/s)
processed 1008680 (1335/s)
processed 1100397 (1342/s)
processed 1181903 (1338/s)
processed 1263406 (1338/s)
processed 1344915 (1339/s)
processed 1426415 (1336/s)
processed 1518115 (1336/s)
processed 1609819 (1339/s)
processed 1701522 (1339/s)
processed 1793209 (1342/s)
processed 1884909 (1344/s)

Things to try:

Use a batch mutator for small inserts (FirstError, etc).
- http://pycassa.github.io/pycassa/api/pycassa/batch.html
Multithreaded / multiprocess on get_range(start='', finish='b'), get_range(start='b', finish='c'), ...
- This will not work:
  - https://github.com/pycassa/pycassa/issues/170
  - http://stackoverflow.com/questions/14730137/cassandra-randompartitioner-and-full-table-scans

ErrorTracker/CassandraQueries (last edited 2013-06-13 10:55:13 by ev)

-  ⇤ ← Revision 5 as of 2013-04-16 10:39:04 → 
  Size: 2520
  Editor: ev
  Comment:
+   ← Revision 7 as of 2013-04-16 13:23:49 → ⇥
  Size: 2757
  Editor: ev
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 17:
+  * http://pycassa.github.io/pycassa/api/pycassa/batch.html
-Line 18:
+Line 19:
+  * This will not work:
   * https://github.com/pycassa/pycassa/issues/170
   * http://stackoverflow.com/questions/14730137/cassandra-randompartitioner-and-full-table-scans

Ubuntu Wiki

CassandraQueries