Setting up Apache Spark with Java on Windows

15 April 2018

Setting up Spark for use in Java in Windows is fairly easy if you know what to do. I will take you through the steps needed here.

We will use the following technologies, which you should already have installed and set up:

Java 8
Apache Maven
IntelliJ IDEA (or another IDE set up to work with Maven)

You should know how to work with Maven.

My set up uses the D: volume, but you should be able to substitute it for C: if you prefer.

Installing Spark

First, you need to download and install Apache Spark. Go to this page and download the archive named spark-2.0.0-bin-hadoop2.7.tgz.

Extract the archive to D:\spark such that you now have the folders D:\spark\bin etcetera.

Now download Hadoop. Copy bin\winutils.exe to the D:\spark\bin folder.

Environment variables

Go to your system's environment variables by typing "environment variables" in the Start menu and selecting "Edit the system environment variables". Add two new variables under the "user variables" section:

HADOOP_HOME with the value D:\spark
SPARK_HOME with the value D:\spark\bin

Now edit the PATH variable and add two new entries:

%HADOOP_HOME%
%SPARK_HOME%

Close all windows by clicking "OK".

Testing the installation

Open a command prompt (Windows+R, enter cmd and press the Return key) and execute spark-shell.cmd. This should launch the Spark shell and among other things print the Spark logo as ASCII art.

Setting up a Maven project

Now we will create a Maven project so that we can use Spark from Java. Create a new Maven project with the quickstart archetype maven-archetype-quickstart. In IntelliJ you can do this through File > New > Project... and selecting Maven in the list, then checking "Create from archetype" and selecting the quickstart prototype.

Open pom.xml and add the following repository:

<repositories>
    <repository>
        <id>cloudera</id>
        <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
</repositories>

Also add these two dependencies:

<dependencies>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_2.11</artifactId>
      <version>2.0.0-cloudera1-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>2.0.0-cloudera1-SNAPSHOT</version>
    </dependency>
</dependencies>

Java

Now edit the App.java file that was created by the Maven archetype and enter this code below the package statement:

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

/**
 * Hello world!
 */
public class App {
    private static final Pattern SPACE = Pattern.compile(" ");
    private static final String INPUT_PATH = "src/main/resources/input.txt";

    public static void main(String[] args) throws IOException {
        SparkSession spark = SparkSession
                .builder()
                .appName("Java_word_count")
                .master("local[4]") // Replace 4 with the number of cores on your processor
                .config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
                .getOrCreate();

        JavaRDD<String> lines = spark.read().textFile(INPUT_PATH).javaRDD();

        JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());

        JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s, 1));

        JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);

        JavaPairRDD<Integer, String> swapped = counts.mapToPair(Tuple2::swap);
        swapped = swapped.sortByKey();

        List<Tuple2<String, Integer>> output = swapped.mapToPair(Tuple2::swap).collect();
        for (Tuple2<?, ?> tuple : output) {
            System.out.println(tuple._1() + ": " + tuple._2());
        }
        spark.stop();
    }
}

Now we just need an input text file. Andrej Karpathy has an example of Character level Recurrent Neural Networks on Github, and on the repository there is an input file available with some Shakespeare plays. Download the text file from the repository and save it in src\main\resources.

You should now be able to run the main function in App.java and obtain a list of word counts after a lot of Spark output.

Congratulations, you have set up Apache Spark for use with Java!