Installing Hadoop

Here we set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS). Installation is easy on Linux platforms than on Windows. Windows requires building Hadoop from source or using a pre-compiled binary for windows which is available here. Version 2.7.1 is a stable release at time of writing. If you want the instructions to build from binaries are here: https://github.com/karthikj1/Hadoop-2.7.1-Windows-64-binaries

Installation

Apache Hadoop 2.7.1 for Windows 64-bit platform

  • Download the precompiled binary for windows

https://github.com/karthikj1/Hadoop-2.7.1-Windows-64-binaries

  • It necessary to modify some configuration files inside \hadoop-2.7.1\etc\hadoop. All such files follow the an XML format, and the updates should concern the top-level node configuration. Specifically:
    •  yarn-site.xml:
      <configuration>
      <!-- Site specific YARN configuration properties -->
        <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce_shuffle</value>
        </property>
      </configuration>
    •  core-site.xml:
      <configuration>
          <property>
           <name>fs.defaultFS</name>
           <value>hdfs://localhost:9000</value>
          </property>
      </configuration>

       

    •  mapred-site.xml (create mapred-site.xml from mapred-site.xml.template if not exist):
      <configuration>
        <property>
          <name>mapreduce.framework.name</name>
          <value>yarn</value>
        </property>
      </configuration>

       

    •  hdfs-site.xml: Create the folders below otherwise Hadoop will use default of \tmp.
      <configuration>
          <property>
              <name>dfs.replication</name>
              <value>1</value>
          </property>
          <property>
              <name>dfs.namenode.name.dir</name>
          <value>file:/hadoop-2.7.1/data/namenode</value>
         </property>
          <property>
              <name>dfs.datanode.data.dir</name>
          <value>file:/hadoop-2.7.1/data/datanode</value>
         </property>
      </configuration>
  • Edit the file \hadoop-2.7.1\etc\hadoop\hadoop-env.cmd to add the following lines near the end of the file.

    @rem set JAVA_HOME=%JAVA_HOME%
    set JAVA_HOME=C:\Java\jdk1.8.0_102
    
    @rem A string representing this instance of hadoop. %USERNAME% by default.
    set HADOOP_IDENT_STRING=%USERNAME%
    set HADOOP_PREFIX=c:\hadoop-2.7.1
    set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
    set YARN_CONF_DIR=%HADOOP_CONF_DIR%
    set PATH=%PATH%;%HADOOP_PREFIX%\bin

     

  •  Format the filesystem with the following command:
    \hadoop-2.7.1\bin\hdfs namenode -format

     

  • Start HDFS Daemons – Run the following command to start the NameNode and DataNode on localhost.
    \hadoop-2.7.1\sbin\start-dfs.cmd

     

  •  Verify that the HDFS daemons are running, try copying a file to HDFS.
    C:\hadoop-2.7.1>bin\hdfs dfs -put README.txt /
    
    C:\hadoop-2.7.1>bin\hdfs dfs -ls /
    Found 1 items
    -rw-r--r--   1 dev supergroup       1366 2016-11-27 18:50 /README.txt

     

  • Start YARN Daemons and run a YARN job –
    \hadoop-2.7.1\sbin\start-yarn.cmd
    
    The cluster should be up and running! To verify, we can run a simple wordcount job on the text file we just copied to HDFS. 
    
    mkdir input
    cp etc\hadoop\*.xml input
    \hadoop-2.7.1\bin\yarn jar 
     \hadoop-2.7.1\bin\hadoop jar share\hadoop\mapreduce\hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+'

     

  • Verfiy the Resourcemanager GUI address – http://localhost:8088

     

    Namenode GUI address – http://localhost:50070

 

Apache Hadoop 2.7.1 for GNU/Linux

  • Download the binary for Linux

http://www.apache.org/dyn/closer.cgi/hadoop/common/

  • Adding a dedicated Hadoop system user

    We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

    12 $ sudo addgroup hadoop $ sudo adduser –ingroup hadoop hduser

    This will add the user hduser and the group hadoop to your local machine.

  • Configuring SSH

    Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

    I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several online guides available.

    First, we have to generate an SSH key for the hduser user.

    123456789101112 user@ubuntu:~$ su – hduser hduser@ubuntu:~$ ssh-keygen -t rsa -P “” Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory ‘/home/hduser/.ssh’. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key’s randomart image is: […snipp…] hduser@ubuntu:~$

    The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

    Second, you have to enable SSH access to your local machine with this newly created key.

    1 hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

    The final step is to test the SSH setup by connecting to your local machine with the hduseruser. The step is also needed to save your local machine’s host key fingerprint to the hduseruser’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

    123456789 hduser@ubuntu:~$ ssh localhost The authenticity of host ‘localhost (::1)’ can’t be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added ‘localhost’ (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS […snipp…] hduser@ubuntu:~$

    If the SSH connect should fail, these general tips might help:

    • Enable debugging with ssh -vvv localhost and investigate the error in detail.
    • Check the SSH server configuration in /etc/ssh/sshd_config, in particular the options PubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add the hduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

For follow these instructions from now:

Apache Hadoop 3.0.0-alpha1 – Hadoop: Setting up a Single Node Cluster.

Source: hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html

References

admin has written 51 articles