Big Text advantages and challenges: classification perspective

Sokolova, Marina

doi:10.1007/s41060-017-0087-5

Big Text advantages and challenges: classification perspective

Trends of Data Science
Published: 21 December 2017

Volume 5, pages 1–10, (2018)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

Marina Sokolova ORCID: orcid.org/0000-0002-7719-9615¹

1714 Accesses
13 Citations
Explore all metrics

Abstract

Big Text, i.e., large repositories of textual data, is a part of Big Data. In total, 80–85 % of Big Text comes in unstructured form, with significant contribution from social media. In this position paper, we discuss Big Text advantages and challenges in respect to text classification. We propose a new approach to performance evaluation of classification algorithms when they applied to Big Text, namely, using corpora comparison in the result evaluation. We also discuss a significant increase in texts with comprehensive information and challenges Big Text methods face in analysis of such texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Message from the Director, Institute for Big Data Analytics, https://bigdata.cs.dal.ca/about.
http://m.crosstalkonline.org/issues/20/179/.
https://ca.linkedin.com/.
http://news.nationalpost.com/news/canada/ontario-firms-social-media-tracking-software-linked-to-racial-profiling-by-u-s-police, retrieved Jan 10, 2017.
https://wordnet.princeton.edu/.

References

Aletras, N., Stevenson, M.: Measuring the similarity between automatically generated topics. In: EACL, pp. 22–27 (2014)
Aly, R., Trieschnigg, D., McGuinness, K., O’Connor, N., De Jong, F.: Average precision: good guide or false friend to multimedia search effectiveness? In: International Conference on Multimedia Modeling, pp. 239–250. Springer, Berlin (2014)
Andersson, A., Davidsson, P., Lindén, J.: Measure-based classifier performance evaluation. Pattern Recognit. Lett. 20(11), 1165–1173 (1999)
Article Google Scholar
Aveda, J., Atxa, J., Carrillo, M., Zengotitabengoa, E.: Automatic text classification to support systematic reviews in medicine. Expert Syst. Appl. 41, 1498–1508 (2014)
Article Google Scholar
Babych, B., Hartley, A.: Meta-evaluation of comparability metrics using parallel corpora. arXiv preprint arXiv:1404.3759 (2014)
Bello-Orgaz, G., Jung, J., Camacho, D.: Social big data: recent achievements and new challenges. Inf. Fusion 28, 1–15 (2015)
Google Scholar
Benamara, F., Cesarano, C., Picariello, A., Reforgiato, D., Subrahmanian, V.: Sentiment analysis: adjectives and adverbs are better than adjectives alone. In: International Conference on Weblogs and Social Media (2007)
Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R., Zesch, T.: Scalable construction of high-quality web corpora. JLCL 28(2), 23–59 (2013)
Google Scholar
Bobicev, V., Sokolova, M., El Emam, K., Jafer, Y., Dewar, B., Jonker, E., Matwin, S.: Can anonymous posters on medical forums be reidentified? J. Med. Internet Res. (2013)
Broussalis, G., Markopoulos, G., Mikros, G.: Stylometric profiling of the Greek Legal Corpus. In: Selected Papers of the 10th International Conference of Greek Linguistics, pp. 167–176 (2012)
Bunch, G., Walqui, A., Pearson, D.: Complex text and new common standards in the United States: pedagogical implications for English learners. Tesol Q. 48(3), 533–559 (2014)
Article Google Scholar
Campbell-Kelly, M., Garcia-Swartz, D.: The history of the Internet: the missing narratives. J. Inf. Technol. 28, 18–33 (2013)
Article Google Scholar
Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (CSUR) 50(3), 43 (2017)
Article Google Scholar
Cao, L., Fayyad, U.: Data science: challenges and directions. Commun. ACM 60, 1–9 (2016)
Google Scholar
Charalampakis, B., Spathis, D., Kouslis, E., Kermanidis, K.: A comparison between semi-supervised and supervised text mining techniques on detecting irony in greek political tweets. Eng. Appl. Artif. Intell. 51, 50–57 (2016)
Article Google Scholar
Cihon, P., Yasseri, T.: A biased review of biases in Twitter studies on political collective action. In: Borge-Holthoefer, J., Moreno, Y., Yasseri, T. (eds.) At the Crossroads: Lessons and Challenges in Computational Social Science, pp. 91–101. Frontiers Media, Lausanne (2016)
Google Scholar
Cohen, A., Hersh, W.: A survey of current work in biomedical text mining. Brief. Bioinform. 6, 57–71 (2005)
Article Google Scholar
Collins, C., Viegas, F., Wattenberg, M.: Parallel tag clouds to explore and analyze faceted text corpora. In: IEEE Symposium on Visual Analytics Science and Technology, pp. 91–98. IEEE (2009)
Crystal, D.: Language and the Internet. Cambridge University Press, Cambridge (2006)
Book Google Scholar
Dunleavy, P.: Big data’and policy learning. In: Stoker, G., Evans, M. (eds.) Evidence-Based Policy Making in the Social Sciences: Methods that Matter, pp. 143–151. The Policy Press, Bristol (2016)
Chapter Google Scholar
Egozi, O., Markovitch, S., Gabrilovich, E.: Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29, 8 (2011)
Article Google Scholar
Eisenstein, J.: What to do about bad language on the Internet. In: HLT-NAACL, pp. 359–369 (2013)
Fankhauser, P., Kermes, H., Teich, E.: Combining macro-and microanalysis for exploring the construal of scientific disciplinarity. In: Proceedings of Digital Humanities (2014)
Fankhauser, P., Knappen, J., Teich, E.: Exploring and visualizing variation in language resources. In: LREC, pp. 4125–4128 (2014)
Fisichella, M., Stewart, A., Denecke, K., Nejdl, W.: Unsupervised public health event detection for epidemic intelligence. In: International Conference on Information and Knowledge Management, pp. 1881–1884. ACM (2010)
Ford, E., Carroll, J., Smith, H., Scott, D., Cassell, J.: Extracting information from the text of electronic medical records to improve case detection: a systematic review. J. Am. Med. Inform. Assoc. 23(5), 1007–1015 (2016)
Article Google Scholar
Forsyth, R., Sharoff, S.: Document dissimilarity within and across languages: a benchmarking study. Lit. Ling. Comput. 29(1), 6–21 (2014)
Article Google Scholar
Fukumoto, F., Suzuki, Y., Matsuyoshi, S.: Text classification from positive and unlabeled data using misclassified data correction. In: ACL, pp. 474–478 (2013)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: IJCAI, pp. 1606–1611 (2007)
Ghazinour, K., Sokolova, M., Matwin, S.: Detecting health-related privacy leaks in social networks using text mining tools. In: Canadian Conference on Artificial Intelligence, pp. 25–39. Springer, Berlin (2013)
Holton, C.: Identifying disgruntled employee systems fraud risk through text mining: a simple solution for a multi-billion dollar problem. Decis. Support Syst. 46, 853–864 (2009)
Article Google Scholar
Japkowicz, N., Stefanowski, J.: A machine learning perspective on big data analysis. In: Japkowicz, N., Stefanowski, J. (eds.) Big Data Analysis: New Algorithms for a New Society, pp. 1–31. Springer, Berlin (2016)
Chapter Google Scholar
Jindal, N., Liu, B.: Opinion spam and analysis. In: International Conference on Web Search and Data Mining, pp. 219–230. ACM (2008)
Kim, S.-M., Hovy, E.: Crystal: Analyzing predictive opinions on the web. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1056–1064. ACL (2007)
Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Assoc. Inf. Sci. Technol. 65, 178–187 (2014)
Article Google Scholar
Lagu, T., Kaufman, E., Asch, D., Armstrong, K.: Content of weblogs written by health professionals. J. Gen. Intern. Med. 23, 1642–1646 (2008)
Article Google Scholar
Lindquist, H., Levin, M.: Apples and oranges: on comparing data from different corpora. Lang. Comput. 33, 201–214 (2000)
Google Scholar
Liu, H., Morstatter, F., Tang, J., Zafarani, R.: The good, the bad, and the ugly: uncovering novel research opportunities in social media mining. Int. J. Data Sci. Anal. 1(3–4), 137–143 (2016)
Article Google Scholar
Mäntylä, M., Graziotin, D., Kuutila, M.: The evolution of sentiment analysis: a review of research topics, venues, and top cited papers. arXiv preprint arXiv:1612.01556 (2016)
Markus, G., Davis, E.: Eight (no, nine!) problems with big data. NYTimes, April 6 (2014)
McLuhan, M.: Understanding Media: The Extensions of Man. MIT Press, Cambridge (1964, 1994)
McNeill, D., Davenport, T.H.: Analytics in Healthcare and the Life Sciences: Strategies, Implementation Methods, and Best Practices. Pearson Education, London (2013)
Google Scholar
Meystre, S., Friedlin, J., South, B., Shen, S., Samore, M.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)
Article Google Scholar
Mohan, S., Guha, A., Harris, M., Popowich, F., Schuster, A., Priebe, C.: The impact of toxic language on the health of reddit communities. In: Canadian Conference on Artificial Intelligence, pp. 51–56. Springer, Berlin (2017)
Mosquera, A., Gutiérrez, Y., Moreda, P.: On evaluating the contribution of text normalisation techniques to sentiment analysis on informal web 2.0 texts. Procesamiento del Lenguaje Natural 58, 29–36 (2017)
Google Scholar
O’Mara-Eves, A., Thomas, J., McNaught, J., Miwa, M., Ananiadou, S.: Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst. Rev. 4(1), 5 (2015)
Article Google Scholar
Ofoghi, B., Mann, M., Verspoor, K.: Towards early discovery of salient health threats: a social media emotion classification technique. In: Biocomputing 2016: Proceedings of the Pacific Symposium, pp. 504–515 (2016)
Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Annual Meeting of the Association for Computational Linguistics, pp. 115–124. ACL (2005)
Patton, D.U., Hong, J.S., Ranney, M., Patel, S., Kelley, C., Eschmann, R., Washington, T.: Social media as a vector for youth violence: a review of the literature. Comput. Hum. Behav. 35, 548–553 (2014)
Article Google Scholar
Pesaranghader, A., Matwin, S., Sokolova, M., Beiko, R.: simDEF: definition-based semantic similarity measure of gene ontology terms for functional similarity analysis of genes. Bioinformatics 32, 1380–1387 (2016)
Article Google Scholar
Piantadosi, S.: Zipf’s word frequency law in natural language: a critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014)
Article Google Scholar
Pollak, S., Coesemans, R., Daelemans, W., Lavrač, N.: Detecting contrast patterns in newspaper articles by combining discourse analysis and text mining. Pragmatics 21, 647–683 (2011)
Article Google Scholar
Rashid, A., Baron, A., Rayson, P., May-Chahal, C., Greenwood, P., Walkerdine, J.: Who am I? analysing digital personas in cybercrime investigations. Computer 46, 54–61 (2013)
Article Google Scholar
Razavi, A., Inkpen, D., Uritsky, S., Matwin, S.: Offensive language detection using multi-level classification. In: Advances in Artificial Intelligence, pp. 16–27. Springer, Berlin (2010)
Rebholz-Schuhmann, D., Oellrich, A., Hoehndorf, R.: Text-mining solutions for biomedical research: enabling integrative biology. Nat. Rev. 13, 829–839 (2012)
Article Google Scholar
Remus, R., Ziegelmayer, D.: Learning from domain complexity. In: LREC, pp. 2021–2028 (2014)
Reyns, B.W., Henson, B., Fisher, B.S.: Being pursued online: applying cyberlifestyle-routine activities theory to cyberstalking victimization. Crim. Justice Behav. 38(11), 1149–1169 (2011)
Article Google Scholar
Riloff, E., Qadir, A., Surve, P., De Silva, L., Gilbert, N., Huang, R.: Sarcasm as contrast between a positive sentiment and negative situation. In: EMNLP, pp. 704–714. ACL (2013)
Schäfer, R., Bildhauer, F.: Automatic classification by topic domain for meta data generation, web corpus evaluation, and corpus comparison. In: 10thWeb as Corpus Workshop, pp. 1–6. ACL (2016)
Schäfer, R., Barbaresi, A., Bildhauer, F.: The good, the bad, and the hazy: design decisions in web corpus construction. In: 8th Web as Corpus Workshop, pp. 1–7 (2013)
Sebastiani, F.: An axiomatically derived measure for the evaluation of classification algorithms. In: International Conference on The Theory of Information Retrieval, pp. 11–20. ACM (2015)
Sim, Y., Acree, B., Gross, J., Smith, N.: Measuring ideological proportions in political speeches. In: Empirical Methods in Natural Language Processing, pp. 91–101. ACL (2013)
Sokolova, M., Lapalme, G.: Verbs speak loud: verb categories in learning polarity and strength of opinions. In: Advances in Artificial Intelligence, pp. 320–331 (2008)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 45(4), 427–437 (2009)
Article Google Scholar
Sokolova, M., Matwin, S.: Personal privacy protection in time of big data. In: Challenges in Computational Statistics and Data Mining, pp. 365–380. Springer, Berlin (2016)
Sokolova, M., Ioshikhes, I., Poursepanj, H., MacKenzie, A.: Helping parents to understand rare diseases. In: Matwin, S., Mielniczuk, J. (eds.) The Workshop on NLP for Medicine and Biology Associated with RANLP, pp. 24–33 (2013)
Sokolova, M., Matwin, S., Jafer, Y., Schramm, D.: How Joe and Jane tweet about their health: mining for personal health information on Twitter. In: RANLP, pp. 626–632 (2013)
Taboada, M.: Sentiment analysis: an overview from linguistics. Annu. Rev. Linguist. 2, 325–347 (2016)
Article Google Scholar
Tan, L., Zhang, H., Clarke, C.L., Smucker, M.D.: Lexical comparison between Wikipedia and Twitter corpora by using word embeddings. In: ACL (2), pp. 657–661 (2015)
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: FACTA: a text search engine for finding associated biomedical concepts. Bioinformatics 24, 2559–2560 (2008)
Article Google Scholar
Tweedie, F.J., Baayen, H.R.: How variable may a constant be? Measures of lexical richness in perspective. Comput. Humanit. 32(5), 323–352 (1998)
Article Google Scholar
Uribe, D., Urquiz, A., Cuan, E.: Analysis of asymmetric measures for performance estimation of a sentiment classifier. Res. Comput. Sci. 65, 75–83 (2013)
Google Scholar
van der Laan, J., Shannon, B., Baker, C.: Identifying Internet mediated securities fraud: trends and technology. In: Web Science Conference (2010)
van Zoonen, W., van der Toni, G.L.: Social media research: the application of supervised machine learning in organizational communication research. Comput. Hum. Behav. 63, 132–141 (2016)
Article Google Scholar
Verheggen, K., Martens, L., Berven, F., Barsnes, H., Vaudel, M.: Database search engines: paradigms, challenges and solutions. In: Mirzaei, H., Carrasco, M. (eds.) Modern Proteomics-Sample Preparation, Analysis and Practical Applications, pp. 147–156. Springer, Berlin (2016)
Chapter Google Scholar
Vogel, R.: Lexical cohesion in popular versus theoretical scientific texts. In: Interpretation of Meaning Across Discourses, pp. 61–74. Masaryk University, Brno (2010)
Vogel, R.: (n.d.). Scientific discussion forums and scientific texts from the perspective of lexical cohesion. In: Approaches to Discourse, pp. 57–69
Wagstaff, K., Riloff, E., Lanza, N., Mattmann, C., Ramirez, P.: Creating a mars target encyclopedia by extracting information from the planetary science literature. In: AAAI Workshop: Knowledge Extraction from Text. AAAI (2016)
Wang, L., Dyer, C., Black, A., Trancoso, I.: Paraphrasing 4 microblog normalization. In: Empirical Methods in Natural Language Processing, pp. 73–84. ACL (2013)
Woodside, A.: Embrace-perform-model: complexity theory, contrarian case analysis, and multiple realities. J. Bus. Res. 67(12), 2495–2503 (2014)
Article Google Scholar
Yang, Z., Wolkowicz, J., Keselj, V.: Social media corporate user identification using text classification. In: Advances in Artificial Intelligence, vol. 27. Springer, Berlin (2014)

Download references

Acknowledgements

The author thanks anonymous reviewers for helpful comments.

Author information

Authors and Affiliations

School of Epidemiology and Public Health, 308E-600 Peter Morand Cres, Ottawa, ON, K1G 5Z3, Canada
Marina Sokolova

Authors

Marina Sokolova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Sokolova.

Ethics declarations

Conflict of interest

The author states that there is no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sokolova, M. Big Text advantages and challenges: classification perspective. Int J Data Sci Anal 5, 1–10 (2018). https://doi.org/10.1007/s41060-017-0087-5

Download citation

Received: 17 July 2017
Accepted: 11 December 2017
Published: 21 December 2017
Issue Date: February 2018
DOI: https://doi.org/10.1007/s41060-017-0087-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big Text advantages and challenges: classification perspective

Abstract

Access this article

Similar content being viewed by others

Text classification algorithms for mining unstructured data: a SWOT analysis

A review of semi-supervised learning for text classification

Text Mining with the Stanford CoreNLP

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Big Text advantages and challenges: classification perspective

Abstract

Access this article

Similar content being viewed by others

Text classification algorithms for mining unstructured data: a SWOT analysis

A review of semi-supervised learning for text classification

Text Mining with the Stanford CoreNLP

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation