Abstract
This research addresses a well-known problem in the area of text mining: The high computational complexity caused by many irrelevant features (terms, words), which may play an appreciable role of noise from the classification point of view and non-linearly rule the time and memory requirements. Using a set of real-world textual documents represented by sentiment related to three selected and extensively tracked Internet sources freely written in English, a group of available algorithms (Gain Ratio, Chi Square, Info Gain, Symmetrical Uncertainty, Winnow, One R, Relief F, Principal Components, SVM, LSA) applied to discovering relevant features was tested with 10,000, 25,000, and 50,000 social-network entries. All the algorithms provided very similar results concerning looking for the relevant features – typically, only the feature significance rank was slightly different. Except for some slower algorithms, the term-preselecting time ranged from seconds to minutes to a couple of hours. However, after using only a relevant fraction of features instead of all of them, the entry length very considerably decreased by several orders of magnitude, particularly for larger data sets having very high dimensionality degree. Despite the extremely strong reduction of the number of words, the classification accuracy remained the same independently on the relevant-feature selection algorithm choice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amazon.com (2016). https://www.amazon.com
Booking.com (2016). https://www.booking.com
Yahoo.com (2016). https://finance.yahoo.com
Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. J. Artif. Intell. 97(1–2), 245–271 (1997)
Dessi, N., Pes, B.: Similarity of feature selection methods: an empirical study across data intensive classification tasks. Expert Syst. Appl. 42(10), 4632–4642 (2015)
Yang, Y., Pederson, J.O.: A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 412–420 (1997)
Žižka, J., Svoboda, A.: Customers’ opinion mining from extensive amount of textual reviews in relation to induced knowledge growth. J. Acta Univ. Agric. Silvic. Mendelianae Brun. 63, 2229–2237 (2015)
Data mining tools See5 and C5.0. RuleQuest Research (2016). https://www.rulequest.com/see5-info.html
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, New York (1993)
Bellman, R.E.: Dynamic Programming. Counter Dover Publications (2003)
Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In: Aggarwal, C.C. (ed.) Data Classification: Algorithms and Applications, pp. 37–64. CRC Press (2014)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Data Transformations. Morgan Kaufmann, San Francisco (2011). Chap. 7
Chikalov, I.: Average Time Complexity of Decision Trees. Intelligent Systems Reference Library, vol. 21. Springer, Heidelberg (2011)
Dařena, F., Žižka, J.: Interdependence of text mining quality and the input data preprocessing. In: Silhavy, R., Senkerik, R., Oplatkova, Z.K., Prokopova, Z., Silhavy, P. (eds.) Artificial Intelligence Perspectives and Applications. AISC, vol. 347, pp. 141–150. Springer, Cham (2015). doi:10.1007/978-3-319-18476-0_15
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: massive online analysis. J. Mach. Learn. Res. 11, 1601–1604 (2010)
Acknowledgments
This research was funded by the Czech Science Foundation, grant No. 16-26353S “Sentiment and its Impact on Stock Markets”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Žižka, J., Dařena, F. (2017). The Comparison of Effects of Relevant-Feature Selection Algorithms on Certain Social-Network Text-Mining Viewpoints. In: Silhavy, R., Senkerik, R., Kominkova Oplatkova, Z., Prokopova, Z., Silhavy, P. (eds) Artificial Intelligence Trends in Intelligent Systems. CSOC 2017. Advances in Intelligent Systems and Computing, vol 573. Springer, Cham. https://doi.org/10.1007/978-3-319-57261-1_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-57261-1_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57260-4
Online ISBN: 978-3-319-57261-1
eBook Packages: EngineeringEngineering (R0)