Embedded HTTP servers with Scala

Recently I was looking for an embedded HTTP server I can use from Scala app. Surprisingly there are not much information on the web and most of it is outdated.  The best one I found is this question on StackOverflow.
Here is some summary of what I found:

  1. Jetty – is tone of the easiest ways to start and well known to Java developers.
  2. Scalatra – very nice framework, lot’s of options, easy to use.
  3. Spray – looks like a very cool project but way to complicated for Scala beginners.
  4. http4s – another promising interface for HTTPs

Cassandra Summit 2015 Key Takeaways

In this post I want to share some key takeaways from recent Apache Cassandra Summit in Santa Clara.
In general I was pleased to see that Cassandra is getting more and more acceptance in the industry. I believe now it can be considered as a mainstream technology and large scale deployment of Cassandra clusters (> than 75000 nodes) is a prove of that.
In this post I will not talk about particular talks I found interested (I hope to dedicate a separate post for that) but I rather want to write about interesting technologies mentioned during the summit:

  1. One of  the most exiting technologies I learned about during the summit is ScylaDB – an open-source C++14 implementation of Cassandra. It uses the latest advancements in C++, OS kernels and hardware drivers to deliver very impressive performance. ScyllaDB developers claim that they can achieve 10x performance over Cassandra in some benchmarks and obviously no more issues with JVM performance and Garbage Collection. It is especially beneficial for large servers with lots of memory and CPU that is not used efficiently by regular Cassandra.I plan to write a dedicated post about ScyllaDB with some benchmarks in the nearest future. Link to github for those who is eager to check it.
  2. Almost all talks mention Cassandra with Apache Spark spark deployment. Looks like it is becoming a standard deployment now. Many groups are also adding Apache Kafka to this list.
  3. I was surprised to learn that DateTieredCompactionStrategy (DTCS) has many issues even though we encountered some of them our-self recently. More information about the issues and proposed solutions you can find in this Jira ticket.
  4. Stratio – full text search in Cassadra based on Apache Lucene.
  5. Stargate – another search based on Lucene. Though my understanding is that  Stratio is a more favorable solution by now.
  6. Apache Zeppelin – not really related to Cassandra but very useful tool.
  7. Presto – distributed SQL query engine that can run on top of Cassandra or Hive.

 

Setting up Scala development environment on Ubuntu

This is a simple step-by-step guide on how to install Scala development environment on Ubuntu.

  1.  Install Java SDK. Here is a simple guide on how to install Java using apt-get.
  2. Install SBT. Either manually download sbt from here and unpack it or use this instruction to get it with apt-get:
    echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
    sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
    sudo apt-get update
    sudo apt-get install sbt
  3. If you plan to use Eclipse for Scala development you can download preconfigured version from here.

IN predicate in Cassandra CQL

Recently I was writing some CQL queries with IN predicate for Apache Cassandra and was surprised that this topic is not covered well on the Web. Here is some information that I found, mainly based on official documentation and this post by Ryan Svihla.
When you are using IN statement with partition key the coordinator node essentially transforms your request into multiple requests with single partition key each and sends them to corresponding nodes. If you have a large set in you IN predicate that can put a heavy load on coordinator node and in general will be slower than multiple individual requests directly from the client. That is kind of counterintuitive for those who came from traditional SQL world. In addition, if coordinator node goes down or times out you have to replay the whole requests rather than individual sub-requests in case of multiple requests from client. Thus don’t use IN statement with partition keys – use multiple async requests.
However if you are using IN statement on a clustering key within one partition that can actually improve performance because the query will be executed only on one node. In addition it would require seek on ordered elements stored continuously on a disk.
I will update the post with relevant benchmarks later.

Creating new disk in linux

# fdisk /dev/vdc
Command (m for help): n
Partition type:
p primary (0 primary, 0 extended, 4 free)
e extended
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-209715199, default 2048):

Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-209715199, default 209715199):
Using default value 209715199

Command (m for help): w
The partition table has been altered!

# mkfs.ext4 /dev/vdc1
#mount /dev/vdb1 /database
#df -h