Movie Recommendation Model Using Machine Learning
Overview
Using movie ratings data given on a scale from 1-5 from individual users, I created a machine learning algorithm that can predict the user ratings of other movies they have not yet rated. In the end, this algorithm could be used in order to recommend to users which movies that it predicts that user would rank highly based off previous rankings.
All of the code for this project is available on my github here: https://github.com/bkphillips/Movie_rating_prediction
The data I am using is the 10 million user ratings MovieLens data available at the link below:
"https://grouplens.org/datasets/movielens/10m/"
The information given in this dataset includes:
User ID
Movie Title
Movie ID
Time Stamp
Genre
Rating
The testing dataset around 9 million and validation set has 1 million observations.
The main machine learning method I am using is matrix factorization utilizing the Movie, User, and Genre effects. The method in which I am testing the validity of my recommendation is by seeing the Root Mean Squared Error (RMSE) of each model on the validation dataset provided. My goal is to get a RMSE of less than 0.8649 on the validation dataset.
Analysis
First, I wanted to look at both the effects of individual users and individual movies on ratings. The variability in the quality of movies will likely be seen in the individual movie effect. Also, some users may be harsh or generous in their ratings, which will be seen in the user effects. Below are the distributions of movie (b_i) and user (b_u):
I also wanted to incorporate the genre into the model in order to get a more accurate recommendation. In order to do this, I converted the genre information into a tidy format in order to provide a more accurate recommendation. Before converting the genre, there were 797 distinct categories for genre shown in the average rating distribution below.
I then converted the data into a long format where I now have only 20 different genres that can more accurately predict the rating, as shown by the plot below of the average rating by the new tidy genres.
Model and Results
The matrix factorization model that ended up providing the most accurate rating was the one that incorporated movie, user, and user-genre effects.
Conclusion
It appears that the fifth model of Movie, User, and User Genre Effects (b_i + b_u + b_ug) had the strongest predictive performance on the validation dataset with a RMSE of 0.8497552. It is interesting that addition of the genre effects on the model did not cause much predictive improvement on the model. The results of this analysis show that individual tastes for a particular genre have a significant impact on their rating for a particular movie. In the real world, this makes sense given that many people have a strong affinity towards a particular genre of movies.
A major limitation of this method is that it requires previous user information and would not have much predictive performance on users that had not already provided rating information into the model.