Abstract:
The overarching goal of this research was to gain an understanding of what the data science Reddit online community discussed before, during, and after COVID-19. We used ...Show MoreMetadata
Abstract:
The overarching goal of this research was to gain an understanding of what the data science Reddit online community discussed before, during, and after COVID-19. We used a publicly available Reddit API to harvest the r/datascience subreddit first level post data. We then performed manual annotation to explore the taxonomy of trends and themes discussed by the practitioners who belonged to reddit data science community. Then, we augmented the manually annotated data using a BERT model with topic modeling. In short, the key discussion themes, in order of frequency, were: Education, Jobs, Methods (of data science), Hardware and data collection, Data visualization, and Quality. The Quality theme includes discussions on bias, transparency, and fairness. Hence, a key finding was that there were very few discussions on data science project quality, especially trying to minimize the risk of machine learning bias. As discussions on bias are not yet common, data science teams should proactively identify and address potential questions and concerns that might arise in data science projects, especially the need to increase the team’s focus on potential bias and fairness.
Date of Conference: 17-20 December 2022
Date Added to IEEE Xplore: 26 January 2023
ISBN Information: