Apache Spark is a free, open-source, general-purpose and distributed computational framework that is created to provide faster computational results. It supports several APIs for streaming, graph processing including, Java, Python, Scala, and R. Generally, Apache Spark can be used in Hadoop clusters, but you can also install it in standalone mode.

In this tutorial, we will show you how to install Apache Spark framework on Debian 11.

Prerequisites

  • A server running Debian 11.
  • A root password is configured on the server.

Install Java

Apache Spark is written in Java. So Java must be installed in your system. If not installed, you can install it using the following command:

apt-get install default-jdk curl -y

Once the Java is installed, verify the Java version using the following command:

java --version

You should get the following output:

openjdk 11.0.12 2021-07-20
OpenJDK Runtime Environment (build 11.0.12 7-post-Debian-2)
OpenJDK 64-Bit Server VM (build 11.0.12 7-post-Debian-2, mixed mode, sharing)

Install Apache Spark

At the time of writing this tutorial, the latest version of Apache Spark is 3.1.2. You can download it using the following command:

wget https://dlcdn.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz

Once the download is completed, extract the downloaded file with the following command:

tar -xvzf spark-3.1.2-bin-hadoop3.2.tgz

Next, move the extracted directory to the /opt with the following command:

mv spark-3.1.2-bin-hadoop3.2/ /opt/spark

Next, edit the ~/.bashrc file and add the Spark path variable:

nano ~/.bashrc

Add the following lines:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and close the file then activate the Spark environment variable using the following command:

source ~/.bashrc

Start Apache Spark

You can now run the following command to start the Spark master service:

start-master.sh

You should get the following output:

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-debian11.out

By default, Apache Spark listens on port 8080. You can verify it using the following command:

ss -tunelp | grep 8080

You will get the following output:

tcp   LISTEN 0      1                                    *:8080             *:*    users:(("java",pid=24356,fd=296)) ino:47523 sk:b cgroup:/user.slice/user-0.slice/session-1.scope v6only:0                                                                                                                                                                                                                                                                     

Next, start the Apache Spark worker process using the following command:

start-slave.sh spark://your-server-ip:7077

Access the Apache Spark Web UI

You can now access the Apache Spark web interface using the URL http://your-server-ip:8080. You should see the Apache Spark master and slave service on the following screen:

<img alt="Apache Spark Dashboard" data-ezsrc="https://kirelos.com/wp-content/uploads/2021/10/echo/p1.png61695d36ae218.jpg" ezimgfmt="rs rscb5 src ng ngcb5" height="387" loading="lazy" src="data:image/svg xml,” width=”750″>

Click on the Worker id. You should see the detailed information of your Worker on the following screen:

<img alt="Spark Worker" data-ezsrc="https://kirelos.com/wp-content/uploads/2021/10/echo/p2.png61695d36e9134.jpg" ezimgfmt="rs rscb5 src ng ngcb5" height="254" loading="lazy" src="data:image/svg xml,” width=”750″>

Connect Apache Spark via Command-line

If you want to connect to Spark via its command shell, run the commands below:

spark-shell

Once you are connected, you will get the following interface:

Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.1.2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.12)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

If you want to use Python in Spark. You can use pyspark command-line utility.

First, install the Python version 2 with the following command:

apt-get install python -y

Once installed, you can connect the Spark with the following command:

pyspark

Once connected, you should get the following output:

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   version 3.1.2
      /_/

Using Python version 3.9.2 (default, Feb 28 2021 17:03:44)
Spark context Web UI available at http://debian11:4040
Spark context available as 'sc' (master = local[*], app id = local-1633769632964).
SparkSession available as 'spark'.
>>> 

Stop Master and Slave

First, stop the slave process using the following command:

stop-slave.sh

You will get the following output:

stopping org.apache.spark.deploy.worker.Worker

Next, stop the master process using the following command:

stop-master.sh

You will get the following output:

stopping org.apache.spark.deploy.master.Master

Conclusion

Congratulations! you have successfully installed Apache Spark on Debian 11. You can now use Apache Spark in your organization to process large datasets