Apache Spark is a free and open-source cluster-computing framework used for analytics, machine learning and graph processing on large volumes of data. Spark comes with 80 high-level operators that enable you to build parallel apps and use it interactively from the Scala, Python, R, and SQL shells. It is a lightning-fast, in-memory data processing engine specially designed for data science. It provides a rich set of features including, Speed, Fault tolerance, Real-time stream processing, In-memory computing, Advance analytics and many more.

In this tutorial, we will show you how to install Apache Spark on Debian 10 server.

Prerequisites

  • A server running Debian 10 with 2 GB of RAM.
  • A root password is configured on your server.

Getting Started

Before starting, it is recommended to update your server with the latest version. You can update it using the following command:

apt-get update -y

 apt-get upgrade -y

Once your server is updated, restart it to implement the changes.

Install Java

Apache Spark is written in the Java language. So you will need to install Java in your system. By default, the latest version of Java is available in the Debian 10 default repository. You can install it using the following command:

apt-get install default-jdk -y

After installing Java, verify the installed version of Java using the following command:

java --version

You should get the following output:

openjdk 11.0.5 2019-10-15
OpenJDK Runtime Environment (build 11.0.5 10-post-Debian-1deb10u1)
OpenJDK 64-Bit Server VM (build 11.0.5 10-post-Debian-1deb10u1, mixed mode, sharing)

Download Apache Spark

First, you will need to download the latest version of the Apache Spark from its official website. At the time of writing this article, the latest version of Apache Spark is 3.0. You can download it to the /opt directory with the following command:

cd /opt

 wget http://apachemirror.wuchna.com/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz

Once the download is completed, extract the downloaded file using the following command:

tar -xvzf spark-3.0.0-preview2-bin-hadoop2.7.tgz

Next, rename the extracted directory to spark as shown below:

mv spark-3.0.0-preview2-bin-hadoop2.7 spark

Next, you will need to set the environment for Spark. You can do it by editing ~/.bashrc file:

nano ~/.bashrc

Add the following lines at the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and close the file when you are finished. Then, activate the environment with the following command:

source ~/.bashrc

Start the Master Server

You can now start the Master server using the following command:

start-master.sh

You should get the following output:

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-debian10.out

By default, Apache Spark listening on the port 8080. You can verify it with the following command:

netstat -ant | grep 8080

Output:

tcp6       0      0 :::8080                 :::*                    LISTEN

Now, open your web browser and type the URL http://server-ip-address:8080. You should see the following page:

Please note down the Spark URL “spark://debian10:7077” from the above image. This will be used to start Spark worker process.

Start Spark Worker Process

Now, you can start the Spark worker process with the following command:

start-slave.sh spark://debian10:7077

You should get the following output:

starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-debian10.out

Access Spark Shell

Spark Shell is an interactive environment that provides a simple way to learn the API and analyze data interactively. You can access the Spark shell with the following command:

spark-shell

You should see the following output:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.0-preview2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
19/12/29 15:53:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://debian10:4040
Spark context available as 'sc' (master = local[*], app id = local-1577634806690).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.0.0-preview2
      /_/
         
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.5)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

From here, you can learn how to make the most out of Apache Spark quickly and conveniently.

If you want to stop Spark Master and Slave server, run the following commands:

stop-slave.sh

 stop-master.sh

Thats it for now, you have successfully installed Apache Spark on Debian 10 server. For more information, you can refer the Spark official documentation at Spark Doc.