Recommender systems significantly improve our online experience. Streaming services, social media and online shopping are shaped by the mechanisms of different recommender systems.
By analyzing the content as well as user preferences, behavior and interactions with content, these systems provide personalized suggestions, helping users discover items, products, or information that aligns with their interests.
They can efficiently navigate large amounts of data, ultimately enhancing user engagement and satisfaction as well as increasing sales and user retention.
Recommender Systems can be split into three categories:
For this project, we compare different systems. Additional details on these systems can be found in these Jupyter Notebooks.
The first system we look at is a Content-Based Filtering System. For this purpose, we use this dataset (include link) that contains data
on the top 1000 movies based on IMDB ratings. We use the description of the movie as well as the name of the director, the genre and the most prominent actors
in the movie as a basis for the system. Building the recommender system relies on a few steps:
- Natural Language Processing: We combine the text data from the different features (overview, director, etc.) into a text. Note that you
can apply different weights based on the importance that the features have for you in the system. The resulting texts undergoes some natural language
processing (lemmatization and TFIDF-vectorization) as it is translated into numerical features.
- Similarity measures: The resulting numerical features of the input data is compared to the similarly processed data of the remaining movies.
The comparison can be based on different measures. For this project, we used so called cosine similarity. There is an explanation of this metric below. Other
possible metrics are the Euclidean distance or the Manhattan distance.
- Selection: Based on the similarity measure, the five movies that are most closely related to the input movies are selected.
We note here that this system has a clear weakness: If you like a movie from a franchise, recommendations are very likely to come from the same franchise.
Additionally, I want to note here that the dataset is fairly small, but that issue can be easily fixed by using a larger dataset. That is a consideration for future work in this direction.
Here, you can test out the content based recommender system by yourself. You can plug in three movies you like and the system will recommend five similar movies.
While content-based system relie (as the name suggests) on the content of the movies, collaborative systems relie on
human interactions with the movies. Collaborative-filtering systems can use user informations to compare the user to other
users with similarity metrics. Recommendations then are made based on what people who are similar to me like. The system That
was developed for this system, however, is item based. Here, items are compared based on user feedback (e.g. reviews) with
similarity metrics. Then, the items that are most closely related to the items that I like are recommended.
Let's discuss in more detail what such a system entails. To build our system, we used a dataset of critic reviews on Rotten
Tomatoes (include link to data set). We proceeded in the following steps:
- Reviewer-Movie-Matrix: The columns are the reviewers and the rows the movies. Each entry tells us what score the reviewer
gave the movie (if any). If the reviewer did not write a review of a specific movie, we set the value to 0. (Most entries are zero, we say the
matrix is sparse.)
- Movie-Similarity-Matrix: The Reviewer-Movie-Matrix gives us a vector of scores for each movie. We compare these vectors pairwise
(here we use cosine similarity, but other metrics are possible) and create a new matrix that consists of these pairwise comparisons.
The entries of this matrix tell us how similar each movie is to the other movies.
- Recommendations: When the user provides their favorite movies, the system checks the Movie-Similarity-Matrix and finds the movies
that are closest to them by adding the cosine similarity between that movie and each of the favorite movies.
Here you can try out the recommender system for yourself. Note that this system works better than the content-based system. Part of the reason might be that
the dataset is significantly larger than the conten-based dataset that was used (1000 movies compared to 10000).