In today’s data-driven world, understanding how to process massive datasets efficiently is vital. Hadoop, an open-source framework designed for distributed data storage and processing, is a cornerstone of big data analytics.
This blog walks you through setting up a single-node Hadoop cluster using Docker, making it easier to experiment and learn without the complexity of multi-node configurations.
The objective of Learning Hadoop
The goal is to deploy a single-node Hadoop cluster using Docker. This is a sandbox environment where you can explore Hadoop’s features without needing a fully distributed setup or altering your primary system.
Why?
- Learn how to set up Hadoop in a controlled environment.
- Test Hadoop features (HDFS, YARN, MapReduce) in a single container.
- Why Docker? Using Docker provides isolation. The Hadoop cluster you set up won’t interfere with your main system. Additionally, Docker containers can be started and stopped easily, making it a flexible setup for experimentation.
In this exercise, you’re setting up a single-node Hadoop cluster within a Docker container. A Hadoop cluster usually consists of multiple nodes that work together to store and process large amounts of data across multiple machines.
In this case, you’re simulating this setup on a single node (one machine) to understand the components and how they interact without needing a full network of computers.
Prerequisites
Installing Docker to the Local Machine
Docker is the core technology used to run the Hadoop container. Verify its installation with:
docker --version
This ensures Docker is correctly set up on your machine.
If not installed, follow Docker’s installation instructions for your OS.
Step 1: Pull the Hadoop Docker Image
Why?
Docker images are pre-configured environments. The bde2020/hadoop-namenode image includes everything needed to set up Hadoop quickly.
How to Pull the Image
To pull an image from the docker hub you can run this command specifying the name of the image you want:
docker pull <name of the image> : <version>
docker pull bde2020/hadoop-namenode:latest
This downloads the image from Docker Hub.
Verify the Download
Check that the image is available on your system:
docker images
You should see the bde2020/hadoop-namenode
image listed.
Step 2: Start the Hadoop Container
Run the Container
The docker run
command creates and starts a container from the pulled image. Use:
docker run -it --name hadoop-cluster -p 9870:9870 -p 8088:8088 -p 50070:50070 bde2020/hadoop-namenode:latest /bin/bash
If Conflicts are occurring
If port conflicts occur because of a previous job, you have to remove the old container and start a new one.
docker rm <Container Id>
Once cleared you will be able to run the Container again smoothly.
Key options:
-it
: Allows you to interact with the container.--name hadoop-cluster
: Assign the container a name for easier management.-p
: Maps container ports to host ports for accessing Hadoop services./bin/bash
: Starts the container with a Bash shell.
Step 3: Start the Hadoop Services
Now we have started and run the Hadoop container within our local machine, we can start all the services needed to continue the process.
Inside the container’s shell, start Hadoop:
start-all.sh
This single code starts everything we installed within the container. This initializes:
- HDFS (Hadoop Distributed File System) for storage.
- YARN (Yet Another Resource Negotiator) for resource management.
But sometimes you may get an error when trying to run all the services at once. The error should look like this:
If this happens, we have to start all the services we need one by one.
Manually Start Hadoop Services
You can start the individual Hadoop services using specific commands directly within the container:
Format the HDFS namenode (only needed the first time):
hdfs namenode -format
Start HDFS services manually:
hdfs namenode &
hdfs datanode &
Start YARN services manually:
yarn resourcemanager &
yarn nodemanager &
These commands start the necessary Hadoop services, though not as a full cluster because it’s set up for a single-node cluster in this case.
Step 4: Check the Status of Services
Once you start the services manually, you can check if they’re running by visiting the Hadoop web interfaces:
HDFS NameNode status: http://localhost:9870
- (This is the NameNode UI, which provides information about HDFS, such as storage usage and file replication.)
YARN ResourceManager: http://localhost:8088
- This is the ResourceManager UI, which shows active jobs, their progress, and system resource utilization.
Step 5: Running a Sample MapReduce Job
Upload Sample Data to HDFS
Inside the container, prepare an input directory in HDFS:
hdfs dfs -mkdir -p /user/hadoop/input
Upload Hadoop’s default configuration XML files as test data:
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /user/hadoop/input
Run the WordCount Job
Hadoop provides example programs. Run the WordCount job to count words in the uploaded files:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /user/hadoop/input /user/hadoop/output
Check the Output
After the job is completed, display the results:
hdfs dfs -cat /user/hadoop/output/part-r-00000
This shows the word frequencies in the sample data.
Step 5: Exiting the Container
This step is not needed. You can stop a running container by giving the `docker stop` command without deleting it and also restart it again using `docker start`
Stop the Container (Use a separate terminal)
To stop the container without deleting it:
docker stop hadoop-cluster
Restart the Container
To restart later:
docker start -i hadoop-cluster
The -i
flag allows you to interact with the container.
Customize HDFS Configurations
Important: Open a new terminal to continue the work.
Objective
Modify HDFS settings to observe their impact on performance. For example:
- Change the replication factor to reduce redundancy.
- Adjust the block size to optimize file processing.
Steps to Edit Configuration Files
- Access the Container
docker exec -it hadoop-cluster /bin/bash
2. Navigate to Configuration Files
cd $HADOOP_HOME/etc/hadoop
3. List all installed documents by typing this command to see whether the file we are going to edit is there.
ls
4. Open a new terminal
Copy the File from the Container to the Host
Use the docker cp
command from the host terminal to copy the core-site.xml
file:
docker cp hadoop-cluster:/opt/hadoop-3.2.1/etc/hadoop/core-site.xml ./core-site.xml
Edit the File on Your Host Machine
Open the file with any text editor on your local system, such as:
- Notepad++ (Windows)
- VS Code
- Nano/Vi (Linux or macOS)
Add the desired configurations (e.g., fs.defaultFS
or replication factor) to the file.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Copy the Modified File Back to the Container
Once you’ve made the changes, copy the file back to the container:
docker cp ./core-site.xml hadoop-cluster:/opt/hadoop-3.2.1/etc/hadoop/core-site.xml
Confirm Changes in the Container
- Re-enter the container:
docker exec -it hadoop-cluster /bin/bash
2. Verify the updated file:
cat /opt/hadoop-3.2.1/etc/hadoop/core-site.xml
Commit All the Changes
As the final step, commit all the changes to a new GitHub repository.