Setting up Apache Spark with Java on Windows
15 April 2018
Setting up Spark for use in Java in Windows is fairly easy if you know what to do. I will take you through the steps needed here.
We will use the following technologies, which you should already have installed and set up:
- Java 8
- Apache Maven
- IntelliJ IDEA (or another IDE set up to work with Maven)
You should know how to work with Maven.
My set up uses the D: volume, but you should be able to substitute it for C: if you prefer.
Installing Spark
First, you need to download and install Apache Spark.
Go to this page and download the archive named
spark-2.0.0-bin-hadoop2.7.tgz.
Extract the archive to D:\spark such that you now have the folders D:\spark\bin etcetera.
Now download Hadoop. Copy
bin\winutils.exe to the D:\spark\bin folder.
Environment variables
Go to your system's environment variables by typing "environment variables" in the Start menu and selecting "Edit the system environment variables". Add two new variables under the "user variables" section:
HADOOP_HOMEwith the valueD:\sparkSPARK_HOMEwith the valueD:\spark\bin
Now edit the PATH variable and add two new entries:
%HADOOP_HOME%%SPARK_HOME%
Close all windows by clicking "OK".
Testing the installation
Open a command prompt (Windows+R, enter cmd and press the Return key) and execute spark-shell.cmd. This should launch the Spark shell
and among other things print the Spark logo as ASCII art.
Setting up a Maven project
Now we will create a Maven project so that we can use Spark from Java.
Create a new Maven project with the quickstart archetype maven-archetype-quickstart. In IntelliJ you can do this through
File > New > Project... and selecting Maven in the list, then checking "Create from archetype" and selecting the quickstart prototype.
Open pom.xml and add the following repository:
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
Also add these two dependencies:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
</dependencies>
Java
Now edit the App.java file that was created by the Maven archetype and enter this code below the package statement:
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* Hello world!
*/
public class App {
private static final Pattern SPACE = Pattern.compile(" ");
private static final String INPUT_PATH = "src/main/resources/input.txt";
public static void main(String[] args) throws IOException {
SparkSession spark = SparkSession
.builder()
.appName("Java_word_count")
.master("local[4]") // Replace 4 with the number of cores on your processor
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate();
JavaRDD<String> lines = spark.read().textFile(INPUT_PATH).javaRDD();
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);
JavaPairRDD<Integer, String> swapped = counts.mapToPair(Tuple2::swap);
swapped = swapped.sortByKey();
List<Tuple2<String, Integer>> output = swapped.mapToPair(Tuple2::swap).collect();
for (Tuple2<?, ?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
spark.stop();
}
}
Now we just need an input text file. Andrej Karpathy has an example of Character level Recurrent Neural Networks on Github,
and on the repository there is an input file available with some Shakespeare plays. Download the text file from the repository and save it in src\main\resources.
You should now be able to run the main function in App.java and obtain a list of word counts after a lot of Spark output.
Congratulations, you have set up Apache Spark for use with Java!