Install Spark/Shark on CDH 4

CDH 4 is the currently stable version of Cloudera Distribution of Hadoop. http://cloudera.com/
Apache Spark is a fast and general engine for large-scale data processing. http://spark.incubator.apache.org/
Shark is a Hive compatible query engine Based on Spark. http://shark.cs.berkeley.edu/

Cloudera provides a parcel for Apache Spark, official parcel at http://archive.cloudera.com/spark/ and you can get it from my mirror (only if you’re on CentOS/RHEL 6 x86_64) Cloudera Mirror.

Environment

CentOS 6.4 x86_64, host names hadoop1-hadoop5.
Cloudera Manager 4.8.1
CDH 4.5.0

Install Spark

Cloudera already gives docs about installing Spark parcel, Installing Spark with Cloudera Manager, only for CDH 4 + CM 4.

Install the Spark Parcel

Cloudera Manager Admin Console => Administration => Settings => Parcels.
If you’re using official repo, use http://archive.cloudera.com/spark/parcels/latest, if you’re using my mirror, use http://cloudera.rst.im/spark/parcels/latest.
Save settings and goto parcels page on the right top of the CM, install the parcel.

Configure Spark

I’m using my hadoop5 machine as the master, so I ssh as root, edit /etc/spark/conf/spark-env.sh:

Leave others as default.
Then edit /etc/spark/conf/slaves like:

Sync /etc/spark/conf/* to other nodes that have Spark deployed.

Start & Stop

Master

Slaves

All

Install Shark

I thought Shark is no more than a gateway, so I feel it not necessary to install Shark to all servers but install it to the one I’m going to work on.
But this is wrong: Trouble Shooting

CDH 4 + Shark 0.9.0 pre-release

The latest release of Spark is 0.9.0 from https://github.com/amplab/shark/releases, which I tried the shark-0.9.0-bin-hadoop2.tgz, with the hive-0.11.0-bin-tgz, is a pre-release and not working for CDH 4.
If you’re doing experiments like me, from Shark command line client, you can get:

In /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-hadoop5.xxx.com.log.out, you can read errors like:

Shark 0.9.0 can be only working with Scala 2.10.3, which is included in Spark’s parcel.
To make shark more near to working, you have to create symbolic links for scala’s libs:

And Spark 0.9.0 only works with the hive-0.11.0-bin-tgz, so the HIVE_HOME should be from your hive-0.11.0-bin-tgz.
After these, you are getting closer and closer, and then saw the error message above.

Let’s build a working Shark on CDH 4.

CDH 4 + Shark 0.8.1

Download files and create folders;

Now we get:

  • /opt/shark: Root for all sharks;
  • /opt/shark/shark: current working shark.
  • /opt/shark/shark/dep: dependencies for shark.
  • /opt/shark/shark/dep/hive: hive for shark.
  • /opt/shark/shark/dep/scala: scala for shark.

And some env from CDH & Spark & Shark:

  • HADOOP_HOME: /opt/cloudera/parcels/CDH/lib/hadoop
  • SPARK_HOME: /opt/cloudera/parcels/SPARK/lib/spark
  • SHARK_HOME: /opt/shark/shark
  • SCALA_HOME: $SHARK_HOME/dep/scala
  • CDH_HIVE_HOME: /opt/cloudera/parcels/CDH/lib/hive
  • HIVE_HOME: $SHARK_HOME/dep/hive/
  • HIVE_CONF_DIR: ???
  • SPARK_LIBRARY_PATH: ???
  • MASTER: spark://hadoop5.xxx.com:7077

Configure $SHARK_HOME/conf/shark-env.sh with variables above.

You must noticed the ??? above, let’s figure them out one by one.

Hive Configuration

In CM/CDH, hive configurations are managed by CM and can be deployed to all nodes for that have Hive Gateway installed from Web Console.
But from the Shark’s official wiki, there tells something more about setting up on a CDH.
If I set HIVE_CONF_DIR to /etc/hive/conf or $CDH_HIVE_HOME/conf, (links $CDH_HIVE_HOME/conf => /etc/hive/conf => /etc/alternatives/hive-conf => /etc/hive/conf.cloudera.hive1), this error comes out:

The solution is:

And add following lines into hive-site.xml of Shark according to the wiki:

Change the nameservice1 to your name service or name node.
Then set HIVE_CONF_DIR to $SHARK_HOME/conf.

lz4 Support

My data imported are all compressed by lz4, when I tried to SELECT a line from Hive table:

Take a look at /etc/spark/conf/spark-env.sh:

which includes /opt/cloudera/parcels/CDH/lib/hadoop/lib/native/.

Download lz4 source code from https://code.google.com/p/lz4/, compile and get the liblz4.so, copy it to all nodes have Spark installed.

Add

to $SHARK_HOME/conf/shark-env.sh

Restart Spark master and slaves.

Trouble Shooting

Unfortunately, Shark cannot connect to Spark cluster after what I’m doing:

But it can be working at a local mode.

Then I tried to build Shark from source: Build Shark 0.9(Master) for CDH 4.
And Shark does not throws any error like failed connecting the cluster.
But a new error comes out:

You can see my jobs is dispatched to all nodes and then failed to execute just because of a class not found.
This tells me what I thought that Shark needs to be deployed to only one machine is totally wrong.
Just sync /opt/shark to all nodes and try again, everything works fine.

Done.

# EOF

Install Spark/Shark on CDH 4 by @sskaje: https://sskaje.me/2014/02/install-spark-shark-cdh-4/

Incoming search terms: