Install Spark/Shark on CDH 4

CDH 4 is the currently stable version of Cloudera Distribution of Hadoop.
Apache Spark is a fast and general engine for large-scale data processing.
Shark is a Hive compatible query engine Based on Spark.

Cloudera provides a parcel for Apache Spark, official parcel at and you can get it from my mirror (only if you’re on CentOS/RHEL 6 x86_64) Cloudera Mirror.


CentOS 6.4 x86_64, host names hadoop1-hadoop5.
Cloudera Manager 4.8.1
CDH 4.5.0

Install Spark

Cloudera already gives docs about installing Spark parcel, Installing Spark with Cloudera Manager, only for CDH 4 + CM 4.

Install the Spark Parcel

Cloudera Manager Admin Console => Administration => Settings => Parcels.
If you’re using official repo, use, if you’re using my mirror, use
Save settings and goto parcels page on the right top of the CM, install the parcel.

Configure Spark

I’m using my hadoop5 machine as the master, so I ssh as root, edit /etc/spark/conf/

Leave others as default.
Then edit /etc/spark/conf/slaves like:

Sync /etc/spark/conf/* to other nodes that have Spark deployed.

Start & Stop




Install Shark

I thought Shark is no more than a gateway, so I feel it not necessary to install Shark to all servers but install it to the one I’m going to work on.
But this is wrong: Trouble Shooting

CDH 4 + Shark 0.9.0 pre-release

The latest release of Spark is 0.9.0 from, which I tried the shark-0.9.0-bin-hadoop2.tgz, with the hive-0.11.0-bin-tgz, is a pre-release and not working for CDH 4.
If you’re doing experiments like me, from Shark command line client, you can get:

In /var/log/hadoop-hdfs/, you can read errors like:

Shark 0.9.0 can be only working with Scala 2.10.3, which is included in Spark’s parcel.
To make shark more near to working, you have to create symbolic links for scala’s libs:

And Spark 0.9.0 only works with the hive-0.11.0-bin-tgz, so the HIVE_HOME should be from your hive-0.11.0-bin-tgz.
After these, you are getting closer and closer, and then saw the error message above.

Let’s build a working Shark on CDH 4.

CDH 4 + Shark 0.8.1

Download files and create folders;

Now we get:

  • /opt/shark: Root for all sharks;
  • /opt/shark/shark: current working shark.
  • /opt/shark/shark/dep: dependencies for shark.
  • /opt/shark/shark/dep/hive: hive for shark.
  • /opt/shark/shark/dep/scala: scala for shark.

And some env from CDH & Spark & Shark:

  • HADOOP_HOME: /opt/cloudera/parcels/CDH/lib/hadoop
  • SPARK_HOME: /opt/cloudera/parcels/SPARK/lib/spark
  • SHARK_HOME: /opt/shark/shark
  • SCALA_HOME: $SHARK_HOME/dep/scala
  • CDH_HIVE_HOME: /opt/cloudera/parcels/CDH/lib/hive
  • HIVE_HOME: $SHARK_HOME/dep/hive/
  • HIVE_CONF_DIR: ???
  • MASTER: spark://

Configure $SHARK_HOME/conf/ with variables above.

You must noticed the ??? above, let’s figure them out one by one.

Hive Configuration

In CM/CDH, hive configurations are managed by CM and can be deployed to all nodes for that have Hive Gateway installed from Web Console.
But from the Shark’s official wiki, there tells something more about setting up on a CDH.
If I set HIVE_CONF_DIR to /etc/hive/conf or $CDH_HIVE_HOME/conf, (links $CDH_HIVE_HOME/conf => /etc/hive/conf => /etc/alternatives/hive-conf => /etc/hive/conf.cloudera.hive1), this error comes out:

The solution is:

And add following lines into hive-site.xml of Shark according to the wiki:

Change the nameservice1 to your name service or name node.
Then set HIVE_CONF_DIR to $SHARK_HOME/conf.

lz4 Support

My data imported are all compressed by lz4, when I tried to SELECT a line from Hive table:

Take a look at /etc/spark/conf/

which includes /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/.

Download lz4 source code from, compile and get the, copy it to all nodes have Spark installed.


to $SHARK_HOME/conf/

Restart Spark master and slaves.

Trouble Shooting

Unfortunately, Shark cannot connect to Spark cluster after what I’m doing:

But it can be working at a local mode.

Then I tried to build Shark from source: Build Shark 0.9(Master) for CDH 4.
And Shark does not throws any error like failed connecting the cluster.
But a new error comes out:

You can see my jobs is dispatched to all nodes and then failed to execute just because of a class not found.
This tells me what I thought that Shark needs to be deployed to only one machine is totally wrong.
Just sync /opt/shark to all nodes and try again, everything works fine.



Install Spark/Shark on CDH 4 by @sskaje:

Incoming search terms: