Abstract

People are mostly aware of issues unique to their own circumstances but sometimes, arguments arise between the sexes due to the lack of awareness of the issues the opposite sex faces. To glean insights on these unique issues, this study mined data using Reddit’s Application Programming Interface (API), specifically two high traffic and high engagement threads from the r/AskReddit subreddit, respectively titled: “What are some men’s issues that are often overlooked?” and “What are women’s issues that are often overlooked?” Then, data cleaning and vectorization is done using tokenization and TF-IDF representation. An initial exploratory data analysis is done to visualize the cleaned, yet unclustered and unreduced most frequent words formed from both posts. Dimensionality reduction using Latent Semantic Analysis applies Single Value Decomposition to trim features to a more manageable size prior to clustering. With reduced dimensionality containing features that capture more of the variance of the data, different clustering methods are tested to generate the best clustering of this reduced data. The results found that using K-Medians at k=5 produced the best clustering, allowing 5 clusters to each be formed for men and women’s issues. Men’s issues revolved around poor mental health, being unheard, double standards, and forced circumcision. Women’s issues, on the other hand, revolved around medical concerns and reproductive health, and sexual objectification. Common issues affecting both sexes are growing concerns about emotional and mental health.