Create ODPi-BigTop-Hadoop-Config-Run.md

Added first content to ODPi BigTop Hadoop
setup and running section

Signed-off-by: Robert Wolff <robert.wolff@linaro.org>
This commit is contained in:
Robert Wolff 2016-06-03 11:47:22 -07:00
parent aa18a8c108
commit fb7326b0e0

View file

@ -0,0 +1,320 @@
This post concentrates on Running Hadoop after [installing](ODPi-Hadoop-Installation.md) ODPi components built using Apache BigTop. These steps are only for configuring it on a single node and running them on a single node.
# Add Hadoop User
We need to create a dedicated user (hduser) for running Hadoop. This user needs to be added to hadoop usergroup:
```shell
sudo useradd -G hadoop hduser
```
give a password for hduser
```shell
sudo passwd hduser
```
Add hduser to sudoers list
On Debian:
```shell
sudo adduser hduser sudo
```
On Centos:
```shell
sudo usermod -G wheel hduser
```
Switch to hduser:
```shell
sudo su - hduser
```
# Generate ssh key for hduser
```shell
ssh-keygen -t rsa -P ""
```
Press \<enter\> to leave to default file name.
Enable ssh access to local machine:
```shell
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
```
Test ssh setup, as hduser:
```shell
ssh localhost
```
# Disabling IPv6
```shell
sudo nano /etc/sysctl.conf
```
Add the below lines to the end and save:
```shell
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
```
Prefer IPv4 on Hadoop:
```shell
sudo nano /etc/hadoop/conf/hadoop-env.sh
```
Uncomment line:
```shell
# export HADOOP_OPTS=-Djava.net.preferIPV4stack=true
```
Run sysctl to apply the changes:
```shell
sudo sysctl -p
```
# Configuring the app environment
Configure the app environment by following steps:
```shell
sudo mkdir -p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp
sudo chmod 750 /app/hadoop/tmp
sudo chown hduser:hadoop /usr/lib/hadoop
sudo chmod 750 /usr/lib/hadoop
```
# Setting up Environment Variables
Follow the below steps to setup Environment Variables in bash file :
```shell
sudo su - hduser
nano .bashrc
```
Add the following to the end and save:
```shell
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib/native"
export HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs
export YARN_HOME=/usr/lib/hadoop-yarn
export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/
export CLASSPATH=$CLASSPATH:.
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/hadoop-common-2.6.0.jar
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/client/hadoop-hdfs-2.6.0.jar
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
export PATH=/usr/lib/hadoop/libexec:/etc/hadoop/conf:$HADOOP_HOME/bin/:$PATH
```
Execute the terminal environment again (`bash`), or simply logout and change to `hduser` again.
# Modifying config files
## core-site.xml
```shell
sudo nano /etc/hadoop/conf/core-site.xml
```
And add/modify the following settings:
Look for property with <name> fs.defaultFS</name> and modify as below:
```shell
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
```
Add this to the bottom before \</configuration> tag:
```shell
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
```
## mapred-site.xml
```shell
sudo nano /etc/hadoop/conf/mapred-site.xml
```
Modify existing properties as follows:
Look for property tag with <name> as mapred.job.tracker and modify as below:
```shell
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
```
## hdfs-site.xml:
```shell
sudo nano /etc/hadoop/conf/hdfs-site.xml
```
Modify existing property as below :
```shell
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
```
# Format Namenode
This step is needed for the first time. Doing it every time will result in loss of content on HDFS.
```shell
sudo /etc/init.d/hadoop-hdfs-namenode init
```
# Start the YARN daemons
```shell
for i in hadoop-hdfs-namenode hadoop-hdfs-datanode ; do sudo service $i start ; done
sudo /etc/init.d/hadoop-yarn-resourcemanager start
sudo /etc/init.d/hadoop-yarn-nodemanager start
```
# Validating Hadoop
Check if hadoop is running. jps command should list namenode, datanode, yarn resource manager. or use ps aux
```shell
sudo jps
```
or
```shell
ps aux | grep java
```
Alternatively, check if yarn managers are running:
```shell
sudo /etc/init.d/hadoop-yarn-resourcemanager status
sudo /etc/init.d/hadoop-yarn-nodemanager status
```
You would see like below:
```shell
● hadoop-yarn-nodemanager.service - LSB: Hadoop nodemanager
Loaded: loaded (/etc/init.d/hadoop-yarn-nodemanager)
Active: active (running) since Tue 2015-12-22 18:25:03 UTC; 1h 24min ago
CGroup: /system.slice/hadoop-yarn-nodemanager.service
└─10366 /usr/lib/jvm/java-1.7.0-openjdk-arm64/bin/java -Dproc_node...
Dec 22 18:24:57 debian su[10348]: Successful su for yarn by root
Dec 22 18:24:57 debian su[10348]: + ??? root:yarn
Dec 22 18:24:57 debian su[10348]: pam_unix(su:session): session opened for ...0)
Dec 22 18:24:57 debian hadoop-yarn-nodemanager[10305]: starting nodemanager, ...
Dec 22 18:24:58 debian su[10348]: pam_unix(su:session): session closed for ...rn
Dec 22 18:25:03 debian hadoop-yarn-nodemanager[10305]: Started Hadoop nodeman...
```
## Run teragen, terasort and teravalidate ##
```shell
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 1000000 terainput
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort terainput teraoutput
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate -D mapred.reduce.tasks=8 teraoutput teravalidate
```
## Stop the Hadoop services ##
```shell
sudo /etc/init.d/hadoop-yarn-nodemanager stop
sudo /etc/init.d/hadoop-yarn-resourcemanager stop
for i in hadoop-hdfs-namenode hadoop-hdfs-datanode ; do sudo service $i stop; done
```
## Potential Errors / Issues and Resolutions ##
* If Teragen, TeraSort and TeraValidate error out with 'permission denied' exception. The following steps can be done:
```shell
sudo groupadd supergroup
sudo usermod -g supergroup hduser
```
* If for some weird reason, if you notice the config files (core-site.xml, hdfs-site.xml, etc) are empty.
```shell
You may have delete all the packages and re-run the steps of installation from scratch.
```
* Error while formatting namenode
With the following command:
```shell
sudo /etc/init.d/hadoop-hdfs-namenode init
If you see the following error:
WARN net.DNS: Unable to determine local hostname -falling back to "localhost"
java.net.UnknownHostException: centos: centos
at java.net.InetAddress.getLocalHost(InetAddress.java:1496)
at org.apache.hadoop.net.DNS.resolveLocalHostname(DNS.java:264)
at org.apache.hadoop.net.DNS.<clinit>(DNS.java:57)
Something is wrong in the network setup. Please check /etc/hosts file.
```shell
sudo nano /etc/hosts
```
The hosts file should like below:
```shell
127.0.0.1 <hostname> localhost localhost.localdomain #hostname should have the output of $ hostname
::1 localhost
```
Also try the following steps:
```shell
sudo rm -Rf /app/hadoop/tmp
hadoop namenode -format
```