Abstract

Internet Movie Database (IMDb) is an online database of films and TV programs information. The study aims to determine the themes of movies through time using unsupervised clustering. Movie data were scraped from IMDb and were stored in a SQLite3 database. The analysis only focused on American movies produced from 1980-2019, filtered by ratings and grouped into four decades: 80’s, 90’s, 2000’s, and 2010’s.

Movies rated 6.0 and above were classified as ‘above average’ and below 6.0 as ‘below average’. Descriptions of each movie were vectorized using term-frequency inverse document frequency (TF-IDF) weighting and dimensions were reduced to n components that would explain at least 80% of the variance using Latent Semantic Analysis (LSA). Unsupervised clustering using the Ward’s Method was used to generate clusters per decade.

Results show that movies set in New York city received above average ratings throughout the four decades. On the other hand, movies based on true stories had significantly increased in ratings in the past two decades (2000-2019), while movies that are violent (with deaths and killings) received lower ratings in three decades, specifically 1980’s, 2000’s, and 2010’s.