Spark Machine Learning Application
Machine Learning application using classification technique, specifically collaborative filtering method, to predict the movies to recommend to a user based on other users’ ratings on different movies.
Our recommendation engine solution will use Alternating Least Squares (ALS) machine learning algorithm.
Even though the data sets used in the code example in this article are not very large in size and complexity, Spark MLlib can be used for any type of real world problems that are complex in nature and deal with data containing several dimensions, and very complex predictor functions. Remember machine learning can be used to solve problems that cannot be solved by numerical means alone.
To keep the machine learning application simple so we can focus on Spark MLlib API, we’ll follow the Movie Recommendations example discussed in Spark Summit workshop. This exercise is a very good resource to learn more about Spark MLlib library. For more details on the application, visit their website.
The use case we want to implement Spark based machine learning solution is a recommendation engine.
Recommendation engines are used to make predictions for unknown user-item associations (e.g. movie ratings) based on known user-item associations. They can make predictions based on based on user’s affinity to other items and other users’ affinity to this specific item. The engine builds a prediction model based on the known data (called Training Data) and then make predictions for unknown user-item associations (called Test Data).
Our program includes the following steps to arrive at the top movie recommendations for the user.
- Load movies data file
- Load the data file with ratings provided by a specific user (you)
- Load the ratings data provided by other users (community)
- Join the user ratings data with community ratings into a single RDD
- Train the model using ALS algorithm using ratings data
- Identify the movies not rated by a particular user (userId = 1)
- Predict the ratings of the items not rated by user
- Get top N recommendations (N=5 in our example)
- Display the recommendation data on the console
We will use the movie datasets provided by MovieLens group. There are few different data files we need for the sample application. These datasets are available for download from GroupLens website. We will use one of the latest datasets (smaller version with 100K ratings). Download the dataset zip file from the website.
Following table shows the different datasets used in the application.
|2||User Ratings Data||
||Ratings by a specific user.||userId,movieId,rating,timestamp|
|3||Community Ratings Data||
||Ratings by other users.||userId,movieId,rating,timestamp|
User ratings file is for the user in context. You can update the ratings in this file based on your movie preferences. We’ll assign a user id called “User Id 0” to represent these ratings.
When we run the recommendation engine program, we’ll join the specific user ratings data with ratings from the community (other users).
We will use Java to write the Spark MLlib program which can be executed as a stand-alone application. The program uses the following MLlib Java classes (all are located in
We will use the following technologies in the sample application to illustrate how Spark MLlib API is used for performing predictive analytics.
Architecture components of the sample application are shown in Figure 4 below.
Figure 4. Spark Machine Learning Sample Application Architecture
There are several implementations of movie recommendation example available in different languages supported by Spark, like Scala (Databricks and MapR), Java (Spark Examplesand Java based Recommendation Engine), and Python. We’ll use the Java solution in our sample application. Download the Java program from Spark examples website to run the example on your local machine. Create a new Maven based Java project called
spark-mllib-sample-app and copy the Java class into the project. Modify the Java class to pass in the data sets discussed in the previous section.
Make sure you include the required Spark Java libraries in the dependencies section of Maven
pom.xml file. To do a clean build and download Spark library JAR files, you can run the following commands.
Set environment parameters for JDK (
JAVA_HOME), Maven (
MAVEN_HOME), and Spark (
For Windows Operating System:
set JAVA_HOME=[JDK_INSTALL_DIRECTORY] set PATH=%PATH%;%JAVA_HOME%\bin set MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY] set PATH=%PATH%;%MAVEN_HOME%\bin set SPARK_HOME=[SPARK_INSTALL_DIRECTORY] set PATH=%PATH%;%SPARK_HOME%\bin cd c:\dev\projects\spark-mllib-sample-app mvn clean install mvn eclipse:clean eclipse:eclipse
If you are using a Linux or Mac OSX system, you can run the following commands:
export JAVA_HOME=[JDK_INSTALL_DIRECTORY] export PATH=$PATH:$JAVA_HOME/bin export MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY] export PATH=$PATH:$MAVEN_HOME/bin export SPARK_HOME=[SPARK_INSTALL_DIRECTORY] export PATH=$PATH:$SPARK_HOME/bin cd /Users/USER_NAME/spark-mllib-sample-app mvn clean install mvn eclipse:clean eclipse:eclipse
If application build is successful, the packaged JAR file will be created in
We will use
spark-submit command to execute the Spark program. Here are the commands for running the program in Windows and Linux/Mac respectively.
%SPARK_HOME%\bin\spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target\spark-mllib-sample-1.0.jar
$SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target/spark-mllib-sample-1.0.jar
Monitoring of Spark MLlib Application
We can monitor the Spark program status on the web console which is available at URL.
Let’s look at some of these visualization tools showing the Spark machine learning program statistics.
We can view the details of all jobs in the sample machine learning program. Click on “Jobs” tab on the web console screen to navigate the Spark Jobs web page that shows these job details.
Figure 5 below shows the status of the jobs from the sample program.
(Click on the image to enlarge it)
Figure 5. Spark jobs statistics for the machine learning program
Direct Acyclic Graph (DAG) shows the dependency graph of different RDD operations we ran in the program. Figure 6 below shows the screenshot of this visualization of Spark machine learning job.
(Click on the image to enlarge it)