skip to main content
10.1145/3357729.3357740acmconferencesArticle/Chapter ViewAbstractPublication PagesdphConference Proceedingsconference-collections
research-article

Comparison of Text Mining Feature Extraction Methods Using Moderated vs Non-Moderated Blogs: An Autism Perspective

Published: 20 November 2019 Publication History

Abstract

Online social media is being widely used by social scientists to study human behavior. Researchers have explored different feature extraction (FE) and classification techniques to perform sentiment analysis, topic identification, etc. Most studies tend to evaluate FE and classification methods using only one particular class of datasets---well-defined with little/no noise or with well-defined noise. For instance, when the datasets under study have different noise characteristics, various FE and/or classification methods may fail to identify a given topic. In this paper, we fill this gap by quantitatively comparing multiple FE methods and classifiers using three different datasets (two moderator-controlled blogs and one single-authored personal blogs) related to Autism Spectrum Disorder (ASD). Our result shows that no particular combination of FE and classifier is the best overall, but choosing the right ones can improve accuracy by over 30%.

References

[1]
Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D Trippe, Juan B Gutierrez, and Krys Kochut. 2017. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017).
[2]
Eissa M Alshari, Azreen Azman, Shyamala Doraisamy, Norwati Mustapha, and Mustafa Alkeshr. 2017. Improvement of Sentiment Analysis Based on Clustering of Word2Vec Features. In Database and Expert Systems Applications (DEXA), 2017 28th International Workshop on. IEEE, 123--126.
[3]
American Psychiatric Association et al. 2013. Diagnostic and statistical manual of mental disorders (DSM-5®). American Psychiatric Pub.
[4]
Adham Beykikhoshk, Ognjen Arandjelovic, Dinh Phung, and Svetha Venkatesh. 2015. Overcoming data scarcity of Twitter: using tweets as bootstrap with application to autism-related topic content analysis. In Advances in Social Networks Analysis and Mining (ASONAM), 2015 IEEE/ACM International Conference on. IEEE, 1354--1361.
[5]
Adham Beykikhoshk, Ognjen Arandjelovic, Dinh Phung, Svetha Venkatesh, and Terry Caelli. 2015. Using Twitter to learn about the autism community. Social Network Analysis and Mining 5, 1 (2015), 22.
[6]
Mekkin Bjarnadottir. 2014. Why text analytics is so important in search. https://www.techradar.com/news/world-of-tech/management/ why-text-analytics-is-so-important-in-search-1247983
[7]
DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.
[8]
Margaret M Bradley and Peter J Lang. 1999. Affective norms for English words (ANEW): Instruction manual and affective ratings. Technical Report. Citeseer.
[9]
Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.
[10]
Andrei M Butnaru and Radu Tudor Ionescu. 2017. From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings. Procedia Computer Science 112 (2017), 1783--1792.
[11]
Serhiy Bykh and Detmar Meurers. 2014. Exploring syntactic features for native language identification: A variationist perspective on feature encoding and ensemble optimization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 1962--1973.
[12]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785--794.
[13]
Ethan Fast, Binbin Chen, and Michael S Bernstein. 2016. Empath: Understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 4647--4657.
[14]
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.
[15]
Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.
[16]
Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.
[17]
Evgeniy Gabrilovich, Shaul Markovitch, et al. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJcAI, Vol. 7. 1606--1611.
[18]
Thorsten Joachims. 1996. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical Report. Carnegie-mellon univ pittsburgh pa dept of computer science.
[19]
Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning. Springer, 137--142.
[20]
Vineet John. 2017. A Survey of Neural Network Techniques for Feature Extraction from Text. arXiv preprint arXiv:1704.08531 (2017).
[21]
Edilson Anselmo Correa Junior, Vanessa Queiroz Marinho, and Leandro Borges dos Santos. 2017. NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment Analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 611--615.
[22]
Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal, Canada, 1137--1145.
[23]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). 1188--1196.
[24]
Haixia Liu. 2017. Sentiment analysis of citations using word2vec. arXiv preprint arXiv:1704.00177 (2017).
[25]
Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.
[26]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[27]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[28]
A Taylor Newton, Adam DI Kramer, and Daniel N McIntosh. 2009. Autism online: a comparison of word usage in bloggers with and without autism spectrum disorders. In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 463--466.
[29]
Thin Nguyen, Thi Duong, Dinh Phung, and Svetha Venkatesh. 2014. Affective, linguistic and topic patterns in online autism communities. In International Conference on Web Information Systems Engineering. Springer, 474--488.
[30]
Thin Nguyen, Thi Duong, Svetha Venkatesh, and Dinh Phung. 2015. Autism blogs: Expressed emotion, language styles and concerns in personal and community settings. IEEE Transactions on Affective Computing 6, 3 (2015), 312--323.
[31]
Thin Nguyen, Dinh Phung, Brett Adams, and Svetha Venkatesh. 2011. Prediction of age, sentiment, and connectivity from social media text. In International Conference on Web Information Systems Engineering. Springer, 227--240.
[32]
Thin Nguyen, Dinh Phung, Bo Dao, Svetha Venkatesh, and Michael Berk. 2014. Affective and content analysis of online depression communities. IEEE Transactions on Affective Computing 5, 3 (2014), 217--226.
[33]
Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen. 2003. Using measures of semantic relatedness for word sense disambiguation. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 241--257.
[34]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[35]
James W Pennebaker, Roger J Booth, and Martha E Francis. 2007. LIWC2007: Linguistic inquiry and word count. Austin, Texas: liwc. net (2007).
[36]
Radim Rehurek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http://is.muni.cz/publication/ 884893/en.
[37]
Siti Hajar Mohd Roffeei, Noorhidawati Abdullah, and Siti Khairatul Razifah Basar. 2015. Seeking social support on Facebook for children with Autism Spectrum Disorders (ASDs). International journal of medical informatics 84, 5 (2015), 375-- 385.
[38]
B. Romero and M. Choudhury. 2006. Social media use in families with autism spectrum disorders. In American Speech-Language-Hearing Association (ASHA) Annual Convention, Philadelphia, PA.
[39]
Stephanie Rude, Eva-Maria Gortner, and James Pennebaker. 2004. Language use of depressed and depression-vulnerable college students. Cognition & Emotion 18, 8 (2004), 1121--1133.
[40]
Amit Saha and Nitin Agarwal. 2015. Demonstrating social support from autism bloggers community on twitter. In Advances in Social Networks Analysis and Mining (ASONAM), 2015 IEEE/ACM International Conference on. IEEE, 1053--1056.
[41]
Amit Saha and Nitin Agarwal. 2015. Insight into Social Support of Autism Blogger Community in Microblogging Platform. In 2015 AAAI Spring Symposium Series.
[42]
Amit Saha and Nitin Agarwal. 2016. Emotional Resiliency of Families Dealing with Autism in Social Media. In Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies. SCITEPRESS-Science and Technology Publications, Lda, 377--382.
[43]
Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems. 2951--2959.
[44]
Statistics-CDC 2018. Data and Statistics| ASD| CDC. Retrieved April 10, 2018 from https://www.cdc.gov/ncbddd/autism/data.html
[45]
Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics 37, 2 (2011), 267--307.
[46]
Vladimir Vapnik. 1998. Statistical learning theory. 1998. Wiley, New York.
[47]
Zhi-Tong Yang and Jun Zheng. 2016. Research on Chinese text classification based on Word2vec. In 2016 2nd IEEE International Conference on Computer and Communications (ICCC). IEEE, 1166--1170.
[48]
Wei Zhu, Wei Zhang, Guo-Zheng Li, Chong He, and Lei Zhang. 2016. A study of damp-heat syndrome classification usingWord2vec and TF-IDF. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, 1415--1420.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DPH2019: Proceedings of the 9th International Conference on Digital Public Health
November 2019
147 pages
ISBN:9781450372084
DOI:10.1145/3357729
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. asd
  2. autism
  3. blogs
  4. classifier
  5. feature evaluation

Qualifiers

  • Research-article

Funding Sources

  • US NSF

Conference

DPH2019
Sponsor:
  • SIGKDD
  • University College London

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 126
    Total Downloads
  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media