Google Search Console Errors:

Hadoop Single Node Clustering with Docker

Dumindu Patabandi

45% from Each penny you Tip will be Donated to a Starving Children!

Share this post

In today’s data-driven world, understanding how to process massive datasets efficiently is vital. Hadoop, an open-source framework designed for distributed data storage and processing, is a cornerstone of big data analytics.

This blog walks you through setting up a single-node Hadoop cluster using Docker, making it easier to experiment and learn without the complexity of multi-node configurations.

The objective of Learning Hadoop

The goal is to deploy a single-node Hadoop cluster using Docker. This is a sandbox environment where you can explore Hadoop’s features without needing a fully distributed setup or altering your primary system.

Why?

Learn how to set up Hadoop in a controlled environment.
Test Hadoop features (HDFS, YARN, MapReduce) in a single container.
Why Docker? Using Docker provides isolation. The Hadoop cluster you set up won’t interfere with your main system. Additionally, Docker containers can be started and stopped easily, making it a flexible setup for experimentation.

In this exercise, you’re setting up a single-node Hadoop cluster within a Docker container. A Hadoop cluster usually consists of multiple nodes that work together to store and process large amounts of data across multiple machines.

In this case, you’re simulating this setup on a single node (one machine) to understand the components and how they interact without needing a full network of computers.

Prerequisites

Installing Docker to the Local Machine

Docker is the core technology used to run the Hadoop container. Verify its installation with:

docker --version

This ensures Docker is correctly set up on your machine.

If not installed, follow Docker’s installation instructions for your OS.

Step 1: Pull the Hadoop Docker Image

Why?

Docker images are pre-configured environments. The bde2020/hadoop-namenode image includes everything needed to set up Hadoop quickly.

How to Pull the Image

To pull an image from the docker hub you can run this command specifying the name of the image you want:

docker pull <name of the image> : <version>

docker pull bde2020/hadoop-namenode:latest

This downloads the image from Docker Hub.

Verify the Download

Check that the image is available on your system:

docker images

You should see the bde2020/hadoop-namenode image listed.

Step 2: Start the Hadoop Container

Run the Container

The docker run command creates and starts a container from the pulled image. Use:

docker run -it --name hadoop-cluster -p 9870:9870 -p 8088:8088 -p 50070:50070 bde2020/hadoop-namenode:latest /bin/bash

If Conflicts are occurring

If port conflicts occur because of a previous job, you have to remove the old container and start a new one.

docker rm <Container Id>

Once cleared you will be able to run the Container again smoothly.

Key options:

-it: Allows you to interact with the container.
--name hadoop-cluster: Assign the container a name for easier management.
-p: Maps container ports to host ports for accessing Hadoop services.
/bin/bash: Starts the container with a Bash shell.

Step 3: Start the Hadoop Services

Now we have started and run the Hadoop container within our local machine, we can start all the services needed to continue the process.

Inside the container’s shell, start Hadoop:

start-all.sh

This single code starts everything we installed within the container. This initializes:

HDFS (Hadoop Distributed File System) for storage.
YARN (Yet Another Resource Negotiator) for resource management.

But sometimes you may get an error when trying to run all the services at once. The error should look like this:

If this happens, we have to start all the services we need one by one.

Manually Start Hadoop Services

You can start the individual Hadoop services using specific commands directly within the container:

Format the HDFS namenode (only needed the first time):

hdfs namenode -format

Start HDFS services manually:

hdfs namenode &
hdfs datanode &

Start YARN services manually:

yarn resourcemanager &
yarn nodemanager &

These commands start the necessary Hadoop services, though not as a full cluster because it’s set up for a single-node cluster in this case.

Step 4: Check the Status of Services

Once you start the services manually, you can check if they’re running by visiting the Hadoop web interfaces:

HDFS NameNode status: http://localhost:9870

(This is the NameNode UI, which provides information about HDFS, such as storage usage and file replication.)

YARN ResourceManager: http://localhost:8088

This is the ResourceManager UI, which shows active jobs, their progress, and system resource utilization.

Step 5: Running a Sample MapReduce Job

Upload Sample Data to HDFS

Inside the container, prepare an input directory in HDFS:

hdfs dfs -mkdir -p /user/hadoop/input

Upload Hadoop’s default configuration XML files as test data:

hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/hadoop/input

Run the WordCount Job

Hadoop provides example programs. Run the WordCount job to count words in the uploaded files:

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /user/hadoop/input /user/hadoop/output

Check the Output

After the job is completed, display the results:

hdfs dfs -cat /user/hadoop/output/part-r-00000

This shows the word frequencies in the sample data.

Step 5: Exiting the Container

This step is not needed. You can stop a running container by giving the `docker stop` command without deleting it and also restart it again using `docker start`

Stop the Container (Use a separate terminal)

To stop the container without deleting it:

docker stop hadoop-cluster

Restart the Container

To restart later:

docker start -i hadoop-cluster

The -i flag allows you to interact with the container.

Customize HDFS Configurations

Important: Open a new terminal to continue the work.

Objective

Modify HDFS settings to observe their impact on performance. For example:

Change the replication factor to reduce redundancy.
Adjust the block size to optimize file processing.

Steps to Edit Configuration Files

Access the Container

docker exec -it hadoop-cluster /bin/bash

2. Navigate to Configuration Files

cd $HADOOP_HOME/etc/hadoop

3. List all installed documents by typing this command to see whether the file we are going to edit is there.

ls

4. Open a new terminal

Copy the File from the Container to the Host
Use the docker cp command from the host terminal to copy the core-site.xml file:

docker cp hadoop-cluster:/opt/hadoop-3.2.1/etc/hadoop/core-site.xml ./core-site.xml

Edit the File on Your Host Machine
Open the file with any text editor on your local system, such as:

Notepad++ (Windows)
VS Code
Nano/Vi (Linux or macOS)

Add the desired configurations (e.g., fs.defaultFS or replication factor) to the file.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:8020</value>
  </property>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:8020</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

</configuration>

Copy the Modified File Back to the Container
Once you’ve made the changes, copy the file back to the container:

docker cp ./core-site.xml hadoop-cluster:/opt/hadoop-3.2.1/etc/hadoop/core-site.xml

Confirm Changes in the Container

Re-enter the container:

docker exec -it hadoop-cluster /bin/bash

2. Verify the updated file:

cat /opt/hadoop-3.2.1/etc/hadoop/core-site.xml

Commit All the Changes

As the final step, commit all the changes to a new GitHub repository.

Write for Us!

Share your beautiful stories with Good Writer. We will show them to the rest of the World to see!
(+ a Backlink)

Submit your Article here.

Subscribe to Dive into stories, how-tos, and insights from top writers and freelancers doing things their own way.

docker, hadoop

Written By

Dumindu Patabandi

Professional Writer

Google Search Console Errors:

Identifying and Fixing Them

Table of Contents

Hadoop Single Node Clustering with Docker

Dumindu Patabandi

The objective of Learning Hadoop

Prerequisites

Installing Docker to the Local Machine

Step 1: Pull the Hadoop Docker Image

Why?

How to Pull the Image

Verify the Download

Step 2: Start the Hadoop Container

Run the Container

If Conflicts are occurring

Step 3: Start the Hadoop Services

Manually Start Hadoop Services

Step 4: Check the Status of Services

Step 5: Running a Sample MapReduce Job

Upload Sample Data to HDFS

Run the WordCount Job

Check the Output

Step 5: Exiting the Container

Stop the Container (Use a separate terminal)

Restart the Container

Customize HDFS Configurations

Objective

Steps to Edit Configuration Files

Confirm Changes in the Container

Commit All the Changes

Write for Us!

Share your beautiful stories with Good Writer. We will show them to the rest of the World to see!(+ a Backlink)

Submit your Article here.

Written By

Dumindu Patabandi

Leave a Reply Cancel reply

You may also like

Why AdSense Ignores Some Areas to Exclude

Everything You Didn’t Know About Ice Cream

Why Some Songs Bring Us Far Back?

Good Writer

Share your beautiful stories with Good Writer. We will show them to the rest of the World to see!
(+ a Backlink)