Spark Machine Learning Example

Spark Machine Learning Application

Machine Learning application using classification technique, specifically collaborative filtering method, to predict the movies to recommend to a user based on other users’ ratings on different movies.

Our recommendation engine solution will use Alternating Least Squares (ALS) machine learning algorithm.

Even though the data sets used in the code example in this article are not very large in size and complexity, Spark MLlib can be used for any type of real world problems that are complex in nature and deal with data containing several dimensions, and very complex predictor functions. Remember machine learning can be used to solve problems that cannot be solved by numerical means alone.

To keep the machine learning application simple so we can focus on Spark MLlib API, we’ll follow the Movie Recommendations example discussed in Spark Summit workshop. This exercise is a very good resource to learn more about Spark MLlib library. For more details on the application, visit their website.

Use Case

The use case we want to implement Spark based machine learning solution is a recommendation engine.

Recommendation engines are used to make predictions for unknown user-item associations (e.g. movie ratings) based on known user-item associations. They can make predictions based on based on user’s affinity to other items and other users’ affinity to this specific item. The engine builds a prediction model based on the known data (called Training Data) and then make predictions for unknown user-item associations (called Test Data).

Our program includes the following steps to arrive at the top movie recommendations for the user.

  • Load movies data file
  • Load the data file with ratings provided by a specific user (you)
  • Load the ratings data provided by other users (community)
  • Join the user ratings data with community ratings into a single RDD
  • Train the model using ALS algorithm using ratings data
  • Identify the movies not rated by a particular user (userId = 1)
  • Predict the ratings of the items not rated by user
  • Get top N recommendations (N=5 in our example)
  • Display the recommendation data on the console

If you want to process or run further analysis on output data, you can store the results in a NoSQL database like Cassandra or MongoDB.

Data Sets

We will use the movie datasets provided by MovieLens group. There are few different data files we need for the sample application. These datasets are available for download from GroupLens website. We will use one of the latest datasets (smaller version with 100K ratings). Download the dataset zip file from the website.

Following table shows the different datasets used in the application.

#

Dataset

File Name

Description

Data Fields

1 Movies Data movies.csv Movie details. movieId,title,genres
2 User Ratings Data user-ratings.csv Ratings by a specific user. userId,movieId,rating,timestamp
3 Community Ratings Data ratings.csv Ratings by other users. userId,movieId,rating,timestamp

User ratings file is for the user in context. You can update the ratings in this file based on your movie preferences. We’ll assign a user id called “User Id 0” to represent these ratings.

When we run the recommendation engine program, we’ll join the specific user ratings data with ratings from the community (other users).

Technologies

We will use Java to write the Spark MLlib program which can be executed as a stand-alone application. The program uses the following MLlib Java classes (all are located in org.apache.spark.mllib.recommendation package):

  • ALS
  • MatrixFactorizationModel
  • Rating

We will use the following technologies in the sample application to illustrate how Spark MLlib API is used for performing predictive analytics.

Architecture components of the sample application are shown in Figure 4 below.

Figure 4. Spark Machine Learning Sample Application Architecture

There are several implementations of movie recommendation example available in different languages supported by Spark, like Scala (Databricks and MapR), Java (Spark Examplesand Java based Recommendation Engine), and Python. We’ll use the Java solution in our sample application. Download the Java program from Spark examples website to run the example on your local machine. Create a new Maven based Java project called spark-mllib-sample-app and copy the Java class into the project. Modify the Java class to pass in the data sets discussed in the previous section.

Make sure you include the required Spark Java libraries in the dependencies section of Maven pom.xml file. To do a clean build and download Spark library JAR files, you can run the following commands.

Set environment parameters for JDK (JAVA_HOME), Maven (MAVEN_HOME), and Spark (SPARK_HOME)

For Windows Operating System:

set JAVA_HOME=[JDK_INSTALL_DIRECTORY]
set PATH=%PATH%;%JAVA_HOME%\bin

set MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY]
set PATH=%PATH%;%MAVEN_HOME%\bin

set SPARK_HOME=[SPARK_INSTALL_DIRECTORY]
set PATH=%PATH%;%SPARK_HOME%\bin

cd c:\dev\projects\spark-mllib-sample-app

mvn clean install
mvn eclipse:clean eclipse:eclipse

If you are using a Linux or Mac OSX system, you can run the following commands:

export JAVA_HOME=[JDK_INSTALL_DIRECTORY]
export PATH=$PATH:$JAVA_HOME/bin

export MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY]
export PATH=$PATH:$MAVEN_HOME/bin

export SPARK_HOME=[SPARK_INSTALL_DIRECTORY]
export PATH=$PATH:$SPARK_HOME/bin

cd /Users/USER_NAME/spark-mllib-sample-app

mvn clean install
mvn eclipse:clean eclipse:eclipse

 

If application build is successful, the packaged JAR file will be created in target directory.

We will use spark-submit command to execute the Spark program. Here are the commands for running the program in Windows and Linux/Mac respectively.

Windows:

%SPARK_HOME%\bin\spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target\spark-mllib-sample-1.0.jar

Linux/Mac:

$SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target/spark-mllib-sample-1.0.jar

Monitoring of Spark MLlib Application

We can monitor the Spark program status on the web console which is available at URL.

Let’s look at some of these visualization tools showing the Spark machine learning program statistics.

We can view the details of all jobs in the sample machine learning program. Click on “Jobs” tab on the web console screen to navigate the Spark Jobs web page that shows these job details.

Figure 5 below shows the status of the jobs from the sample program.

(Click on the image to enlarge it)

Figure 5. Spark jobs statistics for the machine learning program

Direct Acyclic Graph (DAG) shows the dependency graph of different RDD operations we ran in the program. Figure 6 below shows the screenshot of this visualization of Spark machine learning job.

(Click on the image to enlarge it)

 

admin has written 55 articles