Pi cluster (master node Hadoop and Spark)
This post follows my previous post Pi cluster (SSH and static IP), and is basically a documentation of my setup of Hadoop and Spark on the master node of my cluster.
0. Install JDK
On the master node, use the following:
$ sudo apt install openjdk-8-jdk
1. Installing Apache Hadoop
Download and install Hadoop on the master node with the following:
$ cd && wget https://apachemirror.sg.wuchna.com/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
$ sudo tar -xvf hadoop-3.2.1.tar.gz -C /opt/
$ rm hadoop-3.2.1.tar.gz && cd /opt
$ sudo mv hadoop-3.2.1 hadoop
Change the permissions on this directory:
$ sudo chown $USER: -R /opt/hadoop
Edit ~/.bashrc
by appending the following:
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Edit /opt/hadoop/etc/hadoop/hadoop-env.sh
by adding the following:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Verify that Hadoop has been installed correctly by checking the version:
$ cd && hadoop version | grep Hadoop
Hadoop 3.2.1
2. Installing Apache Spark
The process for Spark is very similar to the above. Install with the following commands:
$ wget https://apachemirror.sg.wuchna.com/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
$ sudo tar -xvf spark-3.0.0-bin-hadoop3.2.tgz -C /opt/
$ rm spark-3.0.0-bin-hadoop3.2.tgz && cd /opt/
$ sudo mv spark-3.0.0-bin-hadoop3.2 spark
$ sudo chown $USER: -R /opt/spark
Edit ~/.bashrc
by appending the following:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
Check that Spark has been installed correctly with:
$ spark-shell --version
In my case I had the following warning messages due to the master node being connected to two networks:
WARN Utils: Your hostname, odyssey resolves to a loopback address: 127.0.1.1; using 10.42.0.1 instead (on interface enp2s0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
To resolve, I created the Spark environment configuration file with $ gedit /opt/spark/conf/spark-env.sh
and inserted the following:
#!/usr/bin/env bash
export SPARK_LOCAL_IP=10.42.0.1
3. Configuring Hadoop Distributed File System
Setup HDFS by modifying some configuration files. Files are within /opt/hadoop/etc/hadoop/
.
The first is core-site.xml
:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://odyssey:9000</value>
</property>
</configuration>
The next is hdfs-site.xml
:
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Also, create the following directories and configure ownership:
$ sudo mkdir -p /opt/hadoop_tmp/hdfs/datanode
$ sudo mkdir -p /opt/hadoop_tmp/hdfs/namenode
$ sudo chown $USER: -R /opt/hadoop_tmp
The next file is mapred-site.xml
:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Final file is yarn-site.xml
:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Format the HDFS with:
$ hdfs namenode -format -force
Start HDFS with:
$ start-dfs.sh && start-yarn.sh
Test if HDFS is running properly:
$ jps
3040 NameNode
3762 NodeManager
4037 Jps
3608 ResourceManager
3355 SecondaryNameNode
2687 DataNode
If this were just a single-node “cluster”, configuration would stop here. However, I still have four worker nodes to setup, and the next steps are at: Pi cluster (cluster Hadoop and Spark)