Executive Summary
The COVID-19 pandemic has forced people into insolating themselves at home in order to help reduce the spread of the Coronavirus. This has encouraged people to do most of their activities including communicating with others and obtaining information. This motivated us to explore underlying themes related to COVID-19 that come out of the conversations in Reddit submissions and comments from subreddits related to the pandemic.
To execute our pipeline, we extracted the data dump of monthly reddit submissions and daily comments from the pushshift.io dataset which consist of Reddit daily dumps from January to April 2020. Then, we got the COVID-related submissions and comments from the extracted data. Preprocess the data in preparation for model training. We then trained the topic extraction model using the preprocessed data where we found 35 topics for Reddit submissions and 50 topics for comments. Finally, we analyzed the resulting themes by inspecting the word weights of each topic and the time evolution of counts of identified themes. Here we observed that topics are dynamic in which one dominated the platform at any given time. We also observed that the frequency of submissions increases as a response to significant major events related to such as increase in confirmed cases in specific countries and announcement of lockdowns.
The results of our work may be used to augment moderator bots such as that employed by Reddit as well as use our results in creating machine learning models that predict the number of cases given the trend in themes. Given that we obtained many themes that are related to one another, we also can further tune the model to produce lesser number of clusters.