research-article

Comparison of Text Mining Feature Extraction Methods Using Moderated vs Non-Moderated Blogs: An Autism Perspective

Authors:

Abu Saleh Md Tayeen,

Saleem Masadeh,

Abderrahmen Mtibaa,

Satyajayant Misra,

Moumita ChoudhuryAuthors Info & Claims

DPH2019: Proceedings of the 9th International Conference on Digital Public Health

Pages 69 - 78

https://doi.org/10.1145/3357729.3357740

Published: 20 November 2019 Publication History

Abstract

Online social media is being widely used by social scientists to study human behavior. Researchers have explored different feature extraction (FE) and classification techniques to perform sentiment analysis, topic identification, etc. Most studies tend to evaluate FE and classification methods using only one particular class of datasets---well-defined with little/no noise or with well-defined noise. For instance, when the datasets under study have different noise characteristics, various FE and/or classification methods may fail to identify a given topic. In this paper, we fill this gap by quantitatively comparing multiple FE methods and classifiers using three different datasets (two moderator-controlled blogs and one single-authored personal blogs) related to Autism Spectrum Disorder (ASD). Our result shows that no particular combination of FE and classifier is the best overall, but choosing the right ones can improve accuracy by over 30%.

References

[1]

Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D Trippe, Juan B Gutierrez, and Krys Kochut. 2017. A brief survey of text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017).

[2]

Eissa M Alshari, Azreen Azman, Shyamala Doraisamy, Norwati Mustapha, and Mustafa Alkeshr. 2017. Improvement of Sentiment Analysis Based on Clustering of Word2Vec Features. In Database and Expert Systems Applications (DEXA), 2017 28th International Workshop on. IEEE, 123--126.

[3]

American Psychiatric Association et al. 2013. Diagnostic and statistical manual of mental disorders (DSM-5®). American Psychiatric Pub.

[4]

Adham Beykikhoshk, Ognjen Arandjelovic, Dinh Phung, and Svetha Venkatesh. 2015. Overcoming data scarcity of Twitter: using tweets as bootstrap with application to autism-related topic content analysis. In Advances in Social Networks Analysis and Mining (ASONAM), 2015 IEEE/ACM International Conference on. IEEE, 1354--1361.

Digital Library

[5]

Adham Beykikhoshk, Ognjen Arandjelovic, Dinh Phung, Svetha Venkatesh, and Terry Caelli. 2015. Using Twitter to learn about the autism community. Social Network Analysis and Mining 5, 1 (2015), 22.

[6]

Mekkin Bjarnadottir. 2014. Why text analytics is so important in search. https://www.techradar.com/news/world-of-tech/management/ why-text-analytics-is-so-important-in-search-1247983

[7]

DavidMBlei, AndrewY Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.

Digital Library

[8]

Margaret M Bradley and Peter J Lang. 1999. Affective norms for English words (ANEW): Instruction manual and affective ratings. Technical Report. Citeseer.

[9]

Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.

[10]

Andrei M Butnaru and Radu Tudor Ionescu. 2017. From Image to Text Classification: A Novel Approach based on Clustering Word Embeddings. Procedia Computer Science 112 (2017), 1783--1792.

Digital Library

[11]

Serhiy Bykh and Detmar Meurers. 2014. Exploring syntactic features for native language identification: A variationist perspective on feature encoding and ensemble optimization. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 1962--1973.

[12]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 785--794.

Digital Library

[13]

Ethan Fast, Binbin Chen, and Michael S Bernstein. 2016. Empath: Understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, 4647--4657.

Digital Library

[14]

Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.

[15]

Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.

Digital Library

[16]

Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189--1232.

[17]

Evgeniy Gabrilovich, Shaul Markovitch, et al. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJcAI, Vol. 7. 1606--1611.

Digital Library

[18]

Thorsten Joachims. 1996. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Technical Report. Carnegie-mellon univ pittsburgh pa dept of computer science.

[19]

Thorsten Joachims. 1998. Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning. Springer, 137--142.

Digital Library

[20]

Vineet John. 2017. A Survey of Neural Network Techniques for Feature Extraction from Text. arXiv preprint arXiv:1704.08531 (2017).

[21]

Edilson Anselmo Correa Junior, Vanessa Queiroz Marinho, and Leandro Borges dos Santos. 2017. NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment Analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 611--615.

[22]

Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Montreal, Canada, 1137--1145.

Digital Library

[23]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14). 1188--1196.

Digital Library

[24]

Haixia Liu. 2017. Sentiment analysis of citations using word2vec. arXiv preprint arXiv:1704.00177 (2017).

[25]

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing.

[26]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[27]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

[28]

A Taylor Newton, Adam DI Kramer, and Daniel N McIntosh. 2009. Autism online: a comparison of word usage in bloggers with and without autism spectrum disorders. In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 463--466.

Digital Library

[29]

Thin Nguyen, Thi Duong, Dinh Phung, and Svetha Venkatesh. 2014. Affective, linguistic and topic patterns in online autism communities. In International Conference on Web Information Systems Engineering. Springer, 474--488.

[30]

Thin Nguyen, Thi Duong, Svetha Venkatesh, and Dinh Phung. 2015. Autism blogs: Expressed emotion, language styles and concerns in personal and community settings. IEEE Transactions on Affective Computing 6, 3 (2015), 312--323.

[31]

Thin Nguyen, Dinh Phung, Brett Adams, and Svetha Venkatesh. 2011. Prediction of age, sentiment, and connectivity from social media text. In International Conference on Web Information Systems Engineering. Springer, 227--240.

[32]

Thin Nguyen, Dinh Phung, Bo Dao, Svetha Venkatesh, and Michael Berk. 2014. Affective and content analysis of online depression communities. IEEE Transactions on Affective Computing 5, 3 (2014), 217--226.

[33]

Siddharth Patwardhan, Satanjeev Banerjee, and Ted Pedersen. 2003. Using measures of semantic relatedness for word sense disambiguation. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 241--257.

[34]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[35]

James W Pennebaker, Roger J Booth, and Martha E Francis. 2007. LIWC2007: Linguistic inquiry and word count. Austin, Texas: liwc. net (2007).

[36]

Radim Rehurek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45--50. http://is.muni.cz/publication/ 884893/en.

[37]

Siti Hajar Mohd Roffeei, Noorhidawati Abdullah, and Siti Khairatul Razifah Basar. 2015. Seeking social support on Facebook for children with Autism Spectrum Disorders (ASDs). International journal of medical informatics 84, 5 (2015), 375-- 385.

[38]

B. Romero and M. Choudhury. 2006. Social media use in families with autism spectrum disorders. In American Speech-Language-Hearing Association (ASHA) Annual Convention, Philadelphia, PA.

[39]

Stephanie Rude, Eva-Maria Gortner, and James Pennebaker. 2004. Language use of depressed and depression-vulnerable college students. Cognition & Emotion 18, 8 (2004), 1121--1133.

[40]

Amit Saha and Nitin Agarwal. 2015. Demonstrating social support from autism bloggers community on twitter. In Advances in Social Networks Analysis and Mining (ASONAM), 2015 IEEE/ACM International Conference on. IEEE, 1053--1056.

Digital Library

[41]

Amit Saha and Nitin Agarwal. 2015. Insight into Social Support of Autism Blogger Community in Microblogging Platform. In 2015 AAAI Spring Symposium Series.

[42]

Amit Saha and Nitin Agarwal. 2016. Emotional Resiliency of Families Dealing with Autism in Social Media. In Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies. SCITEPRESS-Science and Technology Publications, Lda, 377--382.

Digital Library

[43]

Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems. 2951--2959.

[44]

Statistics-CDC 2018. Data and Statistics| ASD| CDC. Retrieved April 10, 2018 from https://www.cdc.gov/ncbddd/autism/data.html

[45]

Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-based methods for sentiment analysis. Computational linguistics 37, 2 (2011), 267--307.

[46]

Vladimir Vapnik. 1998. Statistical learning theory. 1998. Wiley, New York.

[47]

Zhi-Tong Yang and Jun Zheng. 2016. Research on Chinese text classification based on Word2vec. In 2016 2nd IEEE International Conference on Computer and Communications (ICCC). IEEE, 1166--1170.

[48]

Wei Zhu, Wei Zhang, Guo-Zheng Li, Chong He, and Lei Zhang. 2016. A study of damp-heat syndrome classification usingWord2vec and TF-IDF. In Bioinformatics and Biomedicine (BIBM), 2016 IEEE International Conference on. IEEE, 1415--1420.

Index Terms

Comparison of Text Mining Feature Extraction Methods Using Moderated vs Non-Moderated Blogs: An Autism Perspective
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
2. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing theory, concepts and paradigms
      1. Social media

Recommendations

Demonstrating Social Support from Autism Bloggers Community on Twitter
ASONAM '15: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015

With an increasing number of kids diagnosed with Autism Spectrum Disorder (ASD), countries face a shortage of resources for efficient delivery of autism support. Social media sites provide open and accessible communication platforms for families, ...
Leveraging curiosity to encourage social interactions in children with Autism Spectrum Disorder: preliminary results using the interactive toy PlusMe
CHI EA '22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems

Autism Spectrum Disorder (ASD) is a set of neurodevelopmental conditions, often characterised by important impairments in the social area. In the context of early intervention, we present preliminary results about the social behaviour of children with ...
Emotional Robot to Examine Differences in Play Patterns and Affective Response of Children with and Without ASD
HRI '16: The Eleventh ACM/IEEE International Conference on Human Robot Interaction

Robots are often employed to proactively engage children with Autism Spectrum Disorder (ASD) in well-defined physical or social activities to promote specific educational or therapeutic outcomes. However, much can also be learned by leveraging a robot's ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DPH2019: Proceedings of the 9th International Conference on Digital Public Health

November 2019

147 pages

ISBN:9781450372084

DOI:10.1145/3357729

Program Chairs:
Patty Kostkova
University College London, UK
,
Caroline Wood
University College London, UK
,
Arnold Bosman
Transmissible, Netherlands
,
Floriana Grasso
University of Liverpool, UK
,
Michael Edelstein
Public Health England, UK

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
University College London: University College London

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

US NSF

Conference

DPH2019

Sponsor:

SIGKDD
University College London

DPH2019: 9th International Digital Public Health Conference (2019)

November 20 - 23, 2019

Marseille, France

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
126
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents