My Computing Note: Running YCSB over Apache Cassandra

1. Introduction

Apache Cassandra is one of an open source key-value storage that utilizes application's in-memory cache data structure to store data. It is inspired by Google's Big Table and Amazon's Dynamo. To benchmark Cassandra, one of the common benchmarking tools is YCSB.

2. Setup

Installing cassandra and YCSB are as simple as un-tarring downloaded files. Instructions can be found here. I describe some updates from the referred web site, which stems from updates in Cassandra 1.2.0.

For creating keyspace in Cassandra 1.2.0 or later Cassandra 0.8.0

A proper syntax to create the keyspace is:
create keyspace usertable with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = [{replication_factor:1}];

Others remain the same from the link.

3. Run YCSB

A good guide from Brian Kooper is here.

So far, we discussed a setup from scratch. Cassandra is not very complex to get things work in general, but you may experience with some troubles when you want to re-create Cassandra cluster. Some of random collections that I faced were summarized in the remainder of this section. And it will grow as I face more troubles.

3.1 Trouble shootings

Adding nodes to cluster outputs 'no other nodes seen' message.

This could be party because your prior installation may use different seed node list in cassandra.yaml and accordingly, your system information, which will be referred by default when you re-launch Cassandra. A solution is clear all files from /var/lib/cassandra/data/system or accordingly you configured before.

How much traffic can I generate from a certain number of YCSB client threads?

In my experience, when the number of threads exceeds some decent numbers like 128 and fixed number of machines are used for generating workloads, the overall maximally achievable throughput without throughput throttling will be flat with out the throughput target option (-target ). Before running actual experiments, you should confirm what is the maximally achievable throughput number with your client machines. Otherwise, your experimental results would be bounded by the clients instead of Cassandra servers.

My Computing Note

Search This Blog

Tuesday, January 22, 2013

Running YCSB over Apache Cassandra