Track Uno

Having some time at my hand over the weekend, I decided to try out Garage Band. The last time I tried to compose music was almost a decade ago.

I have not used any live instruments for this one. I do plan to hopefully give that a shot in the future. I feel pretty satisfied with the result today given that I spent just about 3 hours on it. I hope you like it too.

Evolution of Hadoop support in Cassandra

Below is a compilation of all changes that were made in the Cassandra code base related to Hadoop support. The source for this compilation is http://svn.apache.org/repos/asf/cassandra/trunk/CHANGES.txt. I have tried my best to avoid any misses or mistakes. In case you notice something amiss, please drop in a comment and I will fix it.

1.0.1

  • Skip empty rows when slicing the entire row (CASSANDRA-2855)
  • Make CFIF try rpc_address or fallback to listen_address (CASSANDRA-3214)
  • Accept comma delimited lists of initial thrift connections (CASSANDRA-3185)

0.8.7

0.8.5

  • Fail jobs when Cassandra node has failed but TaskTracker has not (CASSANDRA-2388)

0.8.3

0.8.2

0.8.1

  • Fix race that could result in Hadoop writer failing to throw an exception encountered after close() (CASSANDRA-2755)

0.8.0

0.7.5

  • Allow job configuration to set the CL used in Hadoop jobs (CASSANDRA-2331)

0.7.3

  • Fix Hadoop ColumnFamilyOutputFormat dropping of mutations when batch fills up (CASSANDRA-2255)

0.7.1

0.7.0-rc2

  • Support multiple Mutations per key in hadoop ColumnFamilyOutputFormat (CASSANDRA-1774)

0.7-beta2

  • Remove cassandra.yaml dependency from Hadoop and Pig (CASSADRA-1322)
  • Support for Hadoop Streaming [non-jvm map/reduce via stdin/out] (CASSANDRA-1368)
  • Rewrite Hadoop ColumnFamilyRecordWriter to pool connections, retry to multiple Cassandra nodes, and smooth impact on the Cassandra cluster by using smaller batch sizes (CASSANDRA-1434)

0.7-beta1

0.6.4

0.6.2

  • Fix SlicePredicate serialization inside Hadoop jobs (CASSANDRA-1049)
  • Close Thrift sockets in Hadoop ColumnFamilyRecordReader (CASSANDRA-1081)

0.6.1

  • Use hostnames in CFInputFormat to allow Hadoop’s naive string-based locality comparisons to work (CASSANDRA-955)

0.6.0-beta3

0.6.0-beta1/beta2

 

Using your own URL Shortener in Gwibber

After starting to use my own URL shortner service (see http://url.jairam.me), the next thing I wanted to do was to try and get it integrated into Gwibber (see http://www.gwibber.com). Turns out its really simple to do it. I’ll advice you to do this while Gwibber is not running.

Assumptions

  • This was tested on Ubuntu 11.04 and the instructions are meant to be for Ubuntu (Any Debian based distros should be similar)
  • The version of Gwibber this was tried against was 3.0.0.1.

Step 1 : Create your protocol file

First of all, choose your URL shortner service. I will continue this example taking my own url shortner as example. To keep it secure I have obfuscated some parameter. If you want to use my service, just drop me an email or tweet. You are always welcome to use the web interface at http://url.jairam.me (I know the name is not creative. Suggestions are most welcome).

What you essentially need is a way of making an API call to your service. Once you know the format of the call, create a file, say urljairame.py, whose content should look like this -

  1 """
  2
  3 url.jairam.me interface for Gwibber
  4 jairamc (Jairam Chandar) - 2011-09-30
  5
  6 """
  7
  8 import urllib2
  9
 10 PROTOCOL_INFO = {
 11
 12   "name": "url.jairam.me",
 13   "version": 0.1,
 14   "fqdn" : "http://url.jairam.me",
 15
 16 }
 17
 18 class URLShorter:
 19
 20   def short(self, text):
 21     short = urllib2.urlopen("http://url.jairam.me/yourls-api.php?signature=xxxxxxxxxx&action=shorturl&format=simple&url=%s" % urllib2.quote(text)).read()
 22     return short

The main line to notice is line number 21. Replace the text inside quotes with a suitable API call for the service you want and save this file. For instance, if you wanted to create a bit.ly service for Gwibber (currently not supported), here is an example (of course there is a little more voodoo involved with the bit.ly api) -

http://api.bitly.com/v3/shorten?login=bitlyapidemo&apiKey=R_0da49e0a9118ff35f52f629d2d71bf07&longUrl=http%3A%2F%2Fbetaworks.com%2F&format=txt

Step 2 : Place the URL protocol in correct location

Copy the above protocol file to “/usr/share/pyshared/gwibber/microblog/urlshorter/” and then create a symlink to it from “/usr/lib/python2.7/dist-packages/gwibber/microblog/urlshorter”. Your version of python installed might be different.

sudo mv urljairamme.py /usr/share/pyshared/gwibber/microblog/urlshorter/"
sudo ln -s /usr/share/pyshared/gwibber/microblog/urlshorter/urljairamme.py /usr/lib/python2.7/dist-packages/gwibber/microblog/urlshorter/urljairamme.py

Step 3 : Make Gwibber aware of your protocol file

Edit the __init__.py file (see location below) and add your new URL shortner service.

sudo vi /usr/lib/python2.7/dist-packages/gwibber/microblog/urlshorter/__init__.py

The file should look like this -

  1
  2 import cligs, isgd, tinyurlcom, ur1ca, urljairamme
  3 #import snipurlcom, zima
  4
  5 PROTOCOLS = {
  6   "cli.gs": cligs,
  7   "is.gd": isgd,
  8   #"snipurl.com": snipurlcom,
  9   "tinyurl.com": tinyurlcom,
 10   "ur1.ca": ur1ca,
 11   "url.jairam.me": urljairamme,
 12   #"zi.ma": zima,
 13 }

Notice lines 2 and 11. These are the new/edited lines in the file for the new service.

 Step 4 : Change preferences in Gwibber to the new URL shortner service

  1. Open Gwibber
  2. Edit -> Preferences -> Messages -> Advanced -> Select the new service
And you are done.

Eclipse – How to get the exact command it executes on Run

While trying to run a Scala/Java mix project, I ran into a problem where Eclipse was successfully able to launch my program, whereas when I tried to launch the same program from the command-line, I faced one problem after another. After a lot of search-fix-find-new-problem cycles, I decided to find out exactly what was the command eclipse was launching. Obviously, Google to my rescue. I found this thread – http://stackoverflow.com/questions/1989419/eclipse-is-there-a-way-to-get-eclipse-to-output-the-commands-given-to-run-your-p

Just wanted to jot down the steps here again (these are reproduced as is from the above thread) -

  1. Run your program inside Eclipse.
  2. Go to the Debug perspective.
  3. Terminate the program, or let it end. right click on the second line. (Terminated, exit value… ) and select properties. in there you will have the full command line used.

ZeroMQ Java Binding – Subscriber not receiving messages from Publisher

While trying out the Java bindings for ZeroMQ, I came across this problem. Basically, the subscriber was not receiving messages from the publisher. After a lot of meddling around, turns out that the publisher was trying send messages even before the socket binding had completed. Weird behaviour, but one that could be easily avoided by just putting a sleep for 2 seconds before you start publishing messages.

Read more about ZeroMQ at http://www.zeromq.org/

 

Scala in Eclipse using Maven – M2Eclipse connector

While playing around with building Scala projects using Maven in Eclipse, I ran into a few problems. I solved these using the steps mentioned below.

Assumptions -

When creating your pom.xml file, you will run into the following error -

No marketplace entries found to handle maven-scala-plugin:2.15.2:compile in Eclipse.  Please see Help for more information.

And/Or

Plugin execution not covered by lifecycle configuration: org.scala-tools:maven-scala-plugin

The reason for the above error is that you need to install the m2e-scala connector which still does not seem to have made into the mainstream M2E marketplace (you will encounter this while installing M2E or creating/checking-out a Maven project). To install this connector, add http://alchim31.free.fr/m2e-scala/update-site/ to your Available Software Sites in Eclipse and install the connector following the installation dialog. And you are done.

Hive on Amazon EMR

There are quite a few resources out there that can help you with running Hive on Amazon EMR. I decided to write this more as a reference for myself than anything else. But I do hope it helps people out there.

Please note that these instructions are for :

  • A linux machine and I expect them to be quite similar for a Mac or a Windows (with a linux API layer like Cygwin)
  • Using Amazon EMR via the command line. There are other ways you can use EMR, like Amazon’s web interface.

Setting up Amazon EMR

Step 1 : Create an Amazon AWS account with Amazon and enable your account for Amazon Elastic Map-Reduce.

What you should expect to get out of this step are the following -

  • Access-id
  • Private Key
  • Key-Pair file (your private key to ssh) with a key-pair name (which you would have given at the time of creating the account
This step will require authentication and verification (can do it over the phone) with Amazon.
Step 2 : Install dependencies
apt-get install ruby libopenssl-ruby

Step 3 : Download the Elastic Map Reduce Ruby client into a folder

mkdir emr
cd emr 
wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip
unzip elastic-mapreduce-ruby.zip

Put the credentials (key-pair file) in the same folder as the elastic map reduce files. Create a file called credentials.json if it does not already exist in the same folder where you unzipped the ruby client.

The credentials.json file should look like this -

{
    "access-id": "xxxxxxxxxxxxxxxxxxxx",
    "private-key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "key-pair": "my-key-pair",
    "key-pair-file": "key-pair-file.pem",
    "log-uri": "s3://my-bucket/logs"
}

And that’s it, you are ready to run Elastic Map-Reduce on Amazon. EMR instances have support for the following -

  • Hive
  • Pig
  • Custom Map-Reduce Jobs
  • Built-in capability to read from Amazon S3

Frequently Used Commands

All the below commands should be run from inside the folder where you unzipped the Ruby client.

List all the current jobs

./elastic-mapreduce --list

List all the current active jobs

./elastic-mapreduce --list --active

Get help/documentation

./elastic-mapreduce --help

Start a Hive instance

Interactive Mode

./elastic-mapreduce --create --name "${JOB_NAME}" 
     --hive-interactive --num-instances ${EMR_INSTANCES_NUM}
     --master-instance ${EMR_INSTANCES_TYPE} --alive

This should echo out a job name like “j-VENCHH7KKB32”. Select the instance types and number of machines carefully for optimal usage vs cost ratio.[1] This will look for the credentials file in the same folder. There are options that you can use to override the defaults. See EMR help (previous command) for documentation.

Script Mode

./elastic-mapreduce --create \
    --hive-script --args ${EMR_SCRIPT_PATH} \
    --args -d,OUTPUT_PATH=${OUTPUT_LOCATION_S3} \
    --name "${JOB_NAME}" \
    --num-instances ${EMR_INSTANCES_NUM} \
    --instance-type ${EMR_INSTANCES_TYPE} \
    --credentials ${EMR_CREDENTIALS_FILE})

Logging into your Hive instance

 ./elastic-mapreduce --ssh 'j-VENCHH7KKB32'

Once you are logged in, you might want to install screen as any network glitch might kill your session.

 sudo apt-get install screen

Just type Hive once you are logged in and you are good to run Hive.

Add nodes to currently running job instance

./elastic-mapreduce --add-instance-group TASK 
      --instance-count 4 --instance-type m2.4xlarge 
      --jobflow 'j-VENCHH7KKB32'

The above will add a new instance group – TASK.

UPDATE: Turns out that you can add nodes only to job instances that were started with at least 2 nodes – See http://url.jairam.me/2a

There are three different kinds of instance groups -

  • Master
    • Manages the job flow. Coordinates the distribution of the MR executable and subsets of the raw data, to the core and task instance groups. [2]
  • Core
    • Contains all of the core nodes of a job flow. A core node is an EC2 instance that runs Hadoop map and reduce tasks and stores data using the Hadoop Distributed File System (HDFS). Core nodes are managed by the master node.[2]
  • Task
    • Contains all of the task nodes in a job flow. The task instance group is optional. You can add it when you start the job flow or add a task instance group to a job flow in progress.[2]
If you want to add more machines to an existing instance group use the below command -
./elastic-mapreduce --modify-instance-group TASK 
      --instance-count 4 --instance-type m2.4xlarge 
      --jobflow 'j-VENCHH7KKB32'

Terminating a job instance

./elastic-mapreduce —terminate 'j-VENCHH7KKB32'

Useful Links

Fix skype for linux

Update : Looks like someone in skype realized that they had not put in instructions for linux and have update their page. It’s the same as below.

A lot of us have been facing this issue that skype seems to just kill itself after you login. While skype issued a temporary fix today morning for Windows and Mac, they have conveniently forgotten to mention a solution for their linux client.

Based on their solution for Windows, I managed to fix it for linux as well. Just quit skype and do the following -

rm ~/.Skype/shared.xml

And voila! Restart skype and it should work fine.

Cheers!

Introduction to Brisk

Datastax recently came out with a *distribution* (I am not completely sure I want to call it a distribution) called Brisk. It is a slick combination of Cassandra‘s tested real-time response and Hadoop‘s bigdata analytical capabilities.

Dave Gardner, my colleague at VisualDNA, and I, decided to give it a run. We were truly impressed with its usefulness and ease of use. Dave gave a talk introducing Brisk at CassandraLondon meetup . Here are the slides -

You can watch the video pod-cast of the presentation here.

The source code used in the demo can be found here.

If you decide to give it a try, please do share your feedback. You can reach me on twitter or via email at contact<at>jairam<dot>me