15 April 2018
Setting up Spark for use in Java in Windows is fairly easy if you know what to do. I will take you through the steps needed here.
We will use the following technologies, which you should already have installed and set up:
You should know how to work with Maven.
My set up uses the D:
volume, but you should be able to substitute it for C:
if you prefer.
First, you need to download and install Apache Spark.
Go to this page and download the archive named
spark-2.0.0-bin-hadoop2.7.tgz
.
Extract the archive to D:\spark
such that you now have the folders D:\spark\bin
etcetera.
Now download Hadoop. Copy
bin\winutils.exe
to the D:\spark\bin
folder.
Go to your system's environment variables by typing "environment variables" in the Start menu and selecting "Edit the system environment variables". Add two new variables under the "user variables" section:
HADOOP_HOME
with the value D:\spark
SPARK_HOME
with the value D:\spark\bin
Now edit the PATH
variable and add two new entries:
%HADOOP_HOME%
%SPARK_HOME%
Close all windows by clicking "OK".
Open a command prompt (Windows+R, enter cmd
and press the Return key) and execute spark-shell.cmd
. This should launch the Spark shell
and among other things print the Spark logo as ASCII art.
Now we will create a Maven project so that we can use Spark from Java.
Create a new Maven project with the quickstart archetype maven-archetype-quickstart
. In IntelliJ you can do this through
File > New > Project... and selecting Maven in the list, then checking "Create from archetype" and selecting the quickstart prototype.
Open pom.xml
and add the following repository:
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
</repositories>
Also add these two dependencies:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.0.0-cloudera1-SNAPSHOT</version>
</dependency>
</dependencies>
Now edit the App.java
file that was created by the Maven archetype and enter this code below the package
statement:
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.SparkSession;
import scala.Tuple2;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* Hello world!
*/
public class App {
private static final Pattern SPACE = Pattern.compile(" ");
private static final String INPUT_PATH = "src/main/resources/input.txt";
public static void main(String[] args) throws IOException {
SparkSession spark = SparkSession
.builder()
.appName("Java_word_count")
.master("local[4]") // Replace 4 with the number of cores on your processor
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate();
JavaRDD<String> lines = spark.read().textFile(INPUT_PATH).javaRDD();
JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(SPACE.split(s)).iterator());
JavaPairRDD<String, Integer> ones = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairRDD<String, Integer> counts = ones.reduceByKey((i1, i2) -> i1 + i2);
JavaPairRDD<Integer, String> swapped = counts.mapToPair(Tuple2::swap);
swapped = swapped.sortByKey();
List<Tuple2<String, Integer>> output = swapped.mapToPair(Tuple2::swap).collect();
for (Tuple2<?, ?> tuple : output) {
System.out.println(tuple._1() + ": " + tuple._2());
}
spark.stop();
}
}
Now we just need an input text file. Andrej Karpathy has an example of Character level Recurrent Neural Networks on Github,
and on the repository there is an input file available with some Shakespeare plays. Download the text file from the repository and save it in src\main\resources
.
You should now be able to run the main function in App.java
and obtain a list of word counts after a lot of Spark output.
Congratulations, you have set up Apache Spark for use with Java!