Abstract

With the emergence of Industry 4.0, there is an influx of online searches on how to get, process and analyze data and StackOverflow is one of the most popular platforms in addressing these questions. The research aims to answer the question on uncovering the underlying themes in the top posts about Python in StackOverflow in the last five years.

From the 1.3 million posts about Python extracted from the website, a random sample of 10K posts was used in determining the natural grouping of the topics. Term Frequency–Inverse Document Frequency (TF-IDF) on the post titles was used to vectorize the data. Truncated Singular Value Decomposition (TSVD) was further implemented for dimensionality reduction.

Lastly, a two-fold clustering algorithm was implemented through the K-Means clustering method — Level1: using numeric features and Level2: using thematic features. It was found that top ‘How To’ topics in Python can be categorized into five namely, Hot Post, Trending, S.O.S., Curious Topics, and Spam. Results may aid students, enthusiasts, and academicians in targeting topics to focus on and further develop literature and programs to address the demand for these queries.