Cassandra Summit 2015 Key Takeaways

In this post I want to share some key takeaways from recent Apache Cassandra Summit in Santa Clara.
In general I was pleased to see that Cassandra is getting more and more acceptance in the industry. I believe now it can be considered as a mainstream technology and large scale deployment of Cassandra clusters (> than 75000 nodes) is a prove of that.
In this post I will not talk about particular talks I found interested (I hope to dedicate a separate post for that) but I rather want to write about interesting technologies mentioned during the summit:

  1. One of  the most exiting technologies I learned about during the summit is ScylaDB – an open-source C++14 implementation of Cassandra. It uses the latest advancements in C++, OS kernels and hardware drivers to deliver very impressive performance. ScyllaDB developers claim that they can achieve 10x performance over Cassandra in some benchmarks and obviously no more issues with JVM performance and Garbage Collection. It is especially beneficial for large servers with lots of memory and CPU that is not used efficiently by regular Cassandra.I plan to write a dedicated post about ScyllaDB with some benchmarks in the nearest future. Link to github for those who is eager to check it.
  2. Almost all talks mention Cassandra with Apache Spark spark deployment. Looks like it is becoming a standard deployment now. Many groups are also adding Apache Kafka to this list.
  3. I was surprised to learn that DateTieredCompactionStrategy (DTCS) has many issues even though we encountered some of them our-self recently. More information about the issues and proposed solutions you can find in this Jira ticket.
  4. Stratio – full text search in Cassadra based on Apache Lucene.
  5. Stargate – another search based on Lucene. Though my understanding is that  Stratio is a more favorable solution by now.
  6. Apache Zeppelin – not really related to Cassandra but very useful tool.
  7. Presto – distributed SQL query engine that can run on top of Cassandra or Hive.

 

IN predicate in Cassandra CQL

Recently I was writing some CQL queries with IN predicate for Apache Cassandra and was surprised that this topic is not covered well on the Web. Here is some information that I found, mainly based on official documentation and this post by Ryan Svihla.
When you are using IN statement with partition key the coordinator node essentially transforms your request into multiple requests with single partition key each and sends them to corresponding nodes. If you have a large set in you IN predicate that can put a heavy load on coordinator node and in general will be slower than multiple individual requests directly from the client. That is kind of counterintuitive for those who came from traditional SQL world. In addition, if coordinator node goes down or times out you have to replay the whole requests rather than individual sub-requests in case of multiple requests from client. Thus don’t use IN statement with partition keys – use multiple async requests.
However if you are using IN statement on a clustering key within one partition that can actually improve performance because the query will be executed only on one node. In addition it would require seek on ordered elements stored continuously on a disk.
I will update the post with relevant benchmarks later.