Abstract

A movie recommender system may be the solution to binge-watching and “what to watch next.” A basic movie recommender makes generic suggestions based on popularity and/or genre. This strategy doesn’t deliver individualized suggestions based on profile and history. 

This project seeks to show how Netflix’s computer software works. This model uses movie synopses and genre to propose movies to users. The researchers think a cosine similarity-based movie recommender system might help local media organizations and aspiring streaming service providers. It may also assist customers by boosting decision-making and quality, removing “what to watch next?” 

The recommender system was created using an IMDb scraped dataset, with the features 1) about and 2) genre as the main focus. The raw data were cleaned and preprocessed before converting the relevant features into a Term Frequency and Inverse Document Frequency (TF-IDF) matrix. This sparse matrix representation of token counts served as a basis for the two key techniques implemented in this project: 1) model creation using cosine similarity and 2) cluster analysis using k-means. The latter required dimensionality reduction, therefore Latent Semantic Analysis was performed to the TF-IDF matrix. The output clusters of similar movies were primarily used as pseudo-ground truth labels in evaluating the performance of the movie recommender system. 

The following are the key findings from calculating the model performance: 

  1. The most effective feature combination was the movie synopses paired with genres, which generated the highest precision score. 
  1. The seven output clusters produced by the k-Means clustering algorithm that served as pseudo-ground truth labels were successful in evaluating the model’s performance when combined with the results of the model. 
  1. The recommender system outperforms random prediction by a factor of two. The former had an average precision score of 37.16%, whereas the latter had only 17.20%. 

This movie recommender system is simple, but its performance can be further enhanced. Improvements can be achieved by: 1) acquiring and utilizing a larger dataset, as the model only used roughly 1,400 data points; 2) adding user demographic data as features, and; 3) including TV series data as episodic programming is also thought to be the key driver of binge-watching. 

It is recommended to deep dive into the negative aspects of binge-watching to unravel the various issues surrounding it, as well as to highlight the distinction between compulsive and recreational binge-watching.