Recommender: Movie recommendation
This experiment demonstrates the use of the Matchbox recommender modules to train a movie recommender engine.
# Recommender: Movie recommendations
This experiment demonstrates the use of the Matchbox recommender modules to train a movie recommender engine. We use a pure collaborative filtering approach: the model learns from a collection of users who have all rated a subset of a catalog of movies. Matrix factorization allows us to infer from this latent user preferences and movie traits. These preferences and traits are then used to predict what rating a user will give to unseen movies, so that we can recommend movies that the user is most likely to enjoy.
In this experiment, we will both train the engine and *score* new data, to demonstrate the different modes in which a recommender can be used and evaluated.
# Data
The training data is approximately 225,000 ratings for 15,742 movies by 26,770 users, extracted from Twitter using techniques described in the original paper by Dooms, De Pessemier and Martens [1]. The paper and data can be found on [GitHub](https://github.com/sidooms/MovieTweetings).
Each instance of data is a tuple consisting of a user identifier, a movie identifier, and the rating. The dataset also contains a time-stamp, but we did not use it in this analysis.
![image1][image1]
To this data, we added a file containing movie names extracted from IMDB, joined on the movie identifier from the ratings data.
![image2][image2]
# Model
First, we need to prepare the data for use with the **Train Recommender** module. It requires triplets in this format: `<user, item, rating>`.
![image5][image5]
1. Both the ratings and movie titles have already been uploaded and are available as datasets in Azure ML Studio, so we simply connect them in our experiment.
2. The rating field looks like an integer, but is of type `numeric`. Since the trainer requires an integer rating, we use **Metadata Editor** to cast it to integer.
![image3][image3]
3. TThe **Train Recommender** module is more lenient with respect to the user and item identifiers. One would typically use integer IDs for these too (and use those as keys for names, images and other metadata in the presentation layer), but to make our results easier to work with, we will merge the ratings and movie title datasets, using the **Join** module. Note that we use an inner (1:1) join and need to specify a key column that is common to both the left and right datasets.
![image4][image4]
4. The **Train Recommender** module requires that the input contain three fields used for training, so we use **Project Columns** to select only the **`user ID`**, **`movie name`**, and **`rating fields`**.
![image6][image6]
5. This dataset contains a few conflicting ratings for the same user-movie pairs. This introduces noise in our training and evaluation, so we remove the duplicates, arbitrarily retaining only the first occurrence of each user-movie pair we encounter.
![image7][image7]
As with any statistical model, we want to fit parameters on one set of data and test accuracy on a hold-out set. In a collaborative filtering approach we need to make sure that we can learn something about each user and each item, so we cannot simply take a random sample of all the observations. Fortunately, Azure ML Studio provides a special **Recommender split** option in the **Split** module that lets you control how the train and test samples are selected.
![image8][image8]
For this experiment, we used these settings:
- Fraction of training-only users: 0.75. This means that for 75% of the users we will use all the ratings to train. For the other 25% we will hold out some ratings for testing.
- Fraction of test-user ratings for testing: 0.25. For each user in the test group, we will hold out 25% of that user's ratings for testing the model.
- Fraction of cold users: 0. Cold users are users for whom we have no prior training data. In general the Matchbox algorithm can use optional user metadata to make recommendations for users even before we've seen a single rating. However, for this problem we do not have user metadata, so we will not evaluate on cold users.
- Fraction of cold items: 0. We will treat cold items the same as cold users, and evaluate only on movies for which we've received ratings.
- Fraction of ignored users: 0. In some cases we might want to test an algorithm or settings on a subset of the data. Here we'll train on the full set.
- Fraction of ignored items: 0. Same as for users.
We are now ready to train the model. The **Train Recommender** module requires training tuples as described above, but accepts two optional inputs.
- Item metadata. If available, you could provide for each movie additional features such as the genre, director, lead actor, box office, awards, etc. This information would be input in a dataset similar to the movie names set, using a key column to join to the training dataset, and fields for all other attributes.
- User metadata. If available, you could provide a similar dataset containing features that describe the users.
In general, this additional metadata is of more value for users than for items. This makes sense when you consider that you typically have a large number of ratings for each movie, in which some users might have provided few or no ratings; however, you can infer user preferences from demographics such as gender and age.
The **Train Recommender** module requires two parameters:
- Number of features: This determines the number of latent parameters that will be learned for each user and each item. (Technically, the recommender algorithm is based on factorizing the user-movie interaction matrix. This parameter determines the rank of the approximation.) More features makes for a more powerful model, but risks over fitting the training data. The parameter is usually determined through experimentation, with the goal of finding the smallest number that achieves acceptable performance. For this problem, the default value of 20 features works well, but you are encouraged to try other values.
- Number of iterations: Model parameters are found by random initialization, followed by minimizing a residual error (difference between the true and predicted ratings for each user-movie pair) using an iterative gradient descent technique. The error typically decreases exponentially, meaning that most of the benefit occurs in the initial iterations. Thus, it is common practice not to run the optimization all the way to convergence, but instead limit the iterations to a reasonable number to limit training time. Again, this value is best determined through experimentation, but values greater than 20 are usually sufficient. We used the default of 30.
# Results
In this sample experiment, we demonstrate three different ways that you can use the trained recommender model:
1. To predict ratings
2. To predict top-*n* movies from a list already rated by each user
3. To make *n* recommendations from the full catalog for each user
The first two methods are used simply to evaluate the performance of the learned model, while the last method represents a typical production use case.
![image9][image9]
To perform all three different types of predictions, use the **Score Recommender** module. The module has two required and two optional inputs.
- The first required input is a trained model. In this case we have directly connected the output of the trainer, but for production one would save the trained model and then connect this saved model to the scorer.
- The second input is a dataset to be scored. The format of this dataset will depend on the task, as we will describe below.
- The two optional ports are for user and item metadata, similar to the optional inputs when training. If you used thee inputs when training, you should also provide the same data when scoring.
### Predicting Ratings
Prediction is a straightforward task. You provide an input dataset for which you want to get scores, using the three-item tuple format used for training. The **Score Recommender** module will use the trained model to predict a rating for each user-movie pair, and outputs a tuple consisting of `<user, item, predicted rating>`.
![image10][image10]
To evaluate the accuracy of predictions, we use the **Evaluate Recommender** module. The first input is the testing dataset, containing tuples (movie-user-rating) similar to those provided for training. Typically you will get this data by using the dataset output from the *test* output port of the **Split** odule you used when setting up the experiment.
Note that the **Evaluate Recommender** module requires two parameters:
- Minimum number of items
- Minimum number of users
By using these parameters, you can limit the evaluation to users who have rated at least *n* items; and items that have been rated by at least *m* users, respectively.
![image11][image11]
In this experiment, the second input contains the same set of tuples that we used earlier trin the model; therefore, evaluation will compare the predicted ratings with the actual ratings, using these two metrics:
- Mean Absolute Error (MAE) is the average of the magnitude of the difference between the true and predicted ratings. This is a good measure of the perceived accuracy of the system.
- Root Mean Square Error (RMSE) is the square root of the average of the square of the difference between true and predicted ratings. This measures how well the model approximates the true expected value of the ratings and penalizes large errors more heavily.
![image12][image12]
The real value of these metrics are for comparing different parameter settings for the trainer. For my run we obtained MAE=1.77 and RMSE=2.46. These are reasonable, considering the 1-10 rating scale.
### Recommend Movies from Test
In this part of the experiment, we use the model to create a rank-ordered list of the top *n* movies for each user, but selecting *only* from movies that have already been rated. The input is the same movie-user-rating format that was used for training. The **Score Recommender** module will use the tuples to extract the set of users, and then for each user, create a set of movies to use in building the rank-ordered list.
Note that this time we need to specify two parameters:
- Maximum number of items to recommend: The engine will score all the movies in each user's set, then output up to this number, rank-ordered high-to-low by estimated rating. We choose 5, but one would typically set this to reflect the actual user experience - how many items are we going to present to the user?
- Minimum size of recommended set: Since we're interested in the rank-ordering, it doesn't make sense to include users with fewer than 2 rated items. One can set the bar higher to evaluate how the model does among users with many rated items.
![image13][image13]
The output is a dataset containing a row for each unique user, with a column for each of the *n* requested recommendations.
![image14][image14]
For our final evaluation, we again use the **Evaluate Recommender** module.
The first input is our test split of the data, but this time the scored dataset is recognized as a per-user recommendation, so the module will calculate Normalized Discounted Cumulative Gain (NDCG) instead.
NDCG is a metric commonly used to evaluate search query and recommender results. It produces a number between 0 and 1, with 1 meaning that the ranking is perfect, and 0 meaning that none of the returned items were in the user's actual top-*n*. Intermediate values indicate top-rated items in the list, with credit for having those items near the top. See [this Wikipedia article](http://en.wikipedia.org/wiki/Discounted_cumulative_gain) for more detail.
![image15][image15]
We obtained a value of 0.95 which is pretty good.
### Recomendations from Catalog
A typical use case for a recommender is to request the top *n* items most likely of interest to a user from the catalog of all items. For this mode the input to the scorer should contain only one column, containing the user IDs for which to generate recommendations.
To demonstrate this approach, we generated a list of 100 user IDs by taking the test data and extracting a list of unique user IDs, and then used the _Head_ option in the **Partition and Sample** module to select the first 100.
![image16][image16]
The output shows the three recommendations for each of the 100 user IDs provided. Looks like *The Shawshank Redemption* and *Dark Knight* are popular choices!
![image17][image17]
# Web Service
A key feature of Azure Machine Learning is the ability to easily publish models as web services on windows Azure. In order to publish the item recommender, the first step is to save the trained model. You can do this by clicking the output port of **Train Recommender** and selecting the option, *Save as Trained Model*.
![image19][image19]
We then create a new experiment that has only the scoring module, and add the saved model.
We also need to provide sample input data, so in this case we re-use the data pipeline that we built for sampling 100 user IDs. To specify the Web service entry and exit points, use the special **Web Service** modules. Note that the **Web service input** module is attached to the node where input data would enter the experiment.
![image18][image18]
After successfully running the experiment, it can be published by clicking **Publish Web Service** at the bottom of the experiment canvas.
# References
[1] Simon Dooms, Toon De Pessemier and Luc Martens. MovieTweetings: a Movie Rating Dataset Collected From Twitter. *Workshop on Crowdsourcing and Human Computation for Recommender Systems, CrowdRec.* RecSys 2013
<!-- Images -->
[image1]:http://az712634.vo.msecnd.net/samplesimg/v1/14/data-ratings.png
[image2]:http://az712634.vo.msecnd.net/samplesimg/v1/14/data-movies.png
[image3]:http://az712634.vo.msecnd.net/samplesimg/v1/14/mde.png
[image4]:http://az712634.vo.msecnd.net/samplesimg/v1/14/join.png
[image5]:http://az712634.vo.msecnd.net/samplesimg/v1/14/model-1.png
[image6]:http://az712634.vo.msecnd.net/samplesimg/v1/14/project.png
[image7]:http://az712634.vo.msecnd.net/samplesimg/v1/14/dedup.png
[image8]:http://az712634.vo.msecnd.net/samplesimg/v1/14/split.png
[image9]:http://az712634.vo.msecnd.net/samplesimg/v1/14/train-score.png
[image10]:http://az712634.vo.msecnd.net/samplesimg/v1/14/score-1.png
[image11]:http://az712634.vo.msecnd.net/samplesimg/v1/14/eval.png
[image12]:http://az712634.vo.msecnd.net/samplesimg/v1/14/result-1.png
[image13]:http://az712634.vo.msecnd.net/samplesimg/v1/14/score-2.png
[image14]:http://az712634.vo.msecnd.net/samplesimg/v1/14/out-1.png
[image15]:http://az712634.vo.msecnd.net/samplesimg/v1/14/result-2.png
[image16]:http://az712634.vo.msecnd.net/samplesimg/v1/14/model-2.png
[image17]:http://az712634.vo.msecnd.net/samplesimg/v1/14/out-2.png
[image18]:http://az712634.vo.msecnd.net/samplesimg/v1/14/prod.png
[image19]:http://az712634.vo.msecnd.net/samplesimg/v1/14/save.png