Skip to main content

Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora

  • Conference paper
Book cover ICT Innovations 2013 (ICT Innovations 2013)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 231))

Included in the following conference series:

Abstract

Short texts like advertisements are characterised by a number of slogans, phrases, words, symbols etc. To improve the quality of textual data, it is necessary to filter out noise textual data from important data. The aim of this work is to determine to what extent it is necessary to carry out the time consuming data pre-processing in the process of discovering sequential patterns in English and Slovak advertisement corpora. For this purpose, an experiment was conducted focusing on data pre-processing in these two comparable corpora. We try to find out to what extent removing the stop words has an influence on a quantity and quality of extracted rules. Stop words removal has no impact on the quantity and quality of extracted rules in English as well as in Slovak advertisement corpora. Only language has a significant impact on the quantity and quality of extracted rules.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Feldman, R., Sanger, J.: The text mining handbook. Cambridge University Press (2007)

    Google Scholar 

  2. Choy, M.: Effective Listings of Function Stop words for Twitter. International Jurnal of Advanced Computer Science and Application 3(6), 8–11 (2012)

    Google Scholar 

  3. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge Information Systems 1(1), 1–27 (1999)

    Google Scholar 

  4. Tayi, G.K., Ballou, D.P.: Examining Data Quality. Communications of the ACM 41(2), 54–57 (1998)

    Article  Google Scholar 

  5. Jung, W.: An Investigation of the Impact of Data Quality on Decision Performance. In: Proceedings of the 2004 International Symposium on Information and Communication Technology (ISICT 2004), pp. 166–171 (2004)

    Google Scholar 

  6. Salton, G.: The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice-Hall, Inc., Upper Saddle River (1971)

    Google Scholar 

  7. Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents. In: Berry, M.W., Kogan, J. (eds.) Text Mining: Applications and Theory. John Wiley and Sons, Ltd. (2010)

    Google Scholar 

  8. Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Using Taxonomy, Discriminants, and Signatures for Navigating in Text Databases. In: Proceedings of the 23rd International Conference on Very Large Databases, pp. 446–455 (1997)

    Google Scholar 

  9. Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalabe Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. The VLDB Journal 7, 163–178 (1998)

    Article  Google Scholar 

  10. Silva, C., Ribeiro, B.: The Importance of Stop Word Removal on Recall Values in Text Categorization. In: Proceedings of the International Joint Conference on Neural Networks, vol. 3, pp. 1661–1666. IEEE (2003)

    Google Scholar 

  11. Nisbet, R., Elder, J., Miner, G.: Handbook of statistical analysis and data mining applications. Academic Press, Elsevier (2009)

    Google Scholar 

  12. Alajmi, A., Saad, E.M., Darwish, R.R.: Toward an ARABIC Stop-Words List Generation. International Journal of Computer Applications 46(8), 8–13 (2012)

    Google Scholar 

  13. Munk, M., Kapusta, J., Švec, P.: Data Preprocessing Evaluation for Web Log Mining: Reconstruction of Activities of a Web Visitor. In: International Conference on Computational Science, ICCS 2010, Procedia Computer Science, vol. 1, pp. 2273–2280 (2010)

    Google Scholar 

  14. Munk, M., Drlík, M.: Impact of Different Pre-Processing Tasks on Effective Identification of Users’ Behavioral Patterns in Web-based Educational System. In: International Conference on Computational Science, ICCS 2011, Procedia Computer Science, vol. 4, pp. 1640–1649 (2011)

    Google Scholar 

  15. Munková, et al.: Analysis of Social and Expressive Factors of Requests by Methods of Text Mining. In: Pacific Asia Conference on Language, Information and Computation, PACLIC 26, pp. 515–524 (2012)

    Google Scholar 

  16. Munková, D., Munk, M., Vozár, M.: Data Pre-Processing Evaluation for Text Mining: Transaction/Sequence Model. In: International Conference on Computational Science, ICCS 2013, Procedia Computer Science, vol. 18, pp. 1198–1207 (2013)

    Google Scholar 

  17. Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)

    Google Scholar 

  18. Myerson, R.B.: Fundamentals of social choice theory. Discussion Paper No. 1162 (1996)

    Google Scholar 

  19. Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic Construction of Chinese Stop Word List. In: Proceedings of the 5th WSEAS International Conference on Applied Computer Science, pp. 1010–1015 (2006)

    Google Scholar 

  20. Khosrow, M.: Encyclopedia of Information Science and Technology. Information Sci. 2 edn. (2009)

    Google Scholar 

  21. Sinka, M.P., Come, D.W.: Evolving Better Stoplists for Document Clustering and Web Intelligence. In: Proceedings of the 3rd Hybrid Intelligent Systems Conference. IOS Press, Australia (2003)

    Google Scholar 

  22. El-Khair, I.A.: Effect of Stop Words Elimination for Arabic Information Retrieval: A comparative Study. International Journal of Computing & Information Sciences 4(3), 119–133 (2006)

    Google Scholar 

  23. Yao, Z., Ze-wen, C.: Research on the construction and filter method of stop-word list in text Preprocessing. In: Fourth International Conference on Intelligent Computation Technology and Automation (2011)

    Google Scholar 

  24. Fox, C.: Lexical analysis and stoplists. Information Retrieval - Data Structures & Algorithms 7, 102–130 (1992)

    Google Scholar 

  25. Khler, R.: Quantitative Syntax Analysis. De Gruyter, Berlin (2012)

    Book  Google Scholar 

  26. Snowball, http://snowball.tartarus.org/algorithms/english/stop.txt

  27. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993)

    Google Scholar 

  28. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the 20th International Conference on Very Large Data Bases (1994)

    Google Scholar 

  29. Han, J., Lakshmanan, L.V.S., Pei, J.: Scalable frequent-pattern mining methods: an overview. In: Tutorial notes of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2001)

    Google Scholar 

  30. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, New York (2000)

    Google Scholar 

  31. Gadušová, Z., Gromová, E.: Discourse Analysis in Translation. In: 1st Nitra Conference on Discourse Studies. Trends and Perspectives, pp. 59–64 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daša Munková .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Munková, D., Munk, M., Vozár, M. (2014). Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora. In: Trajkovik, V., Anastas, M. (eds) ICT Innovations 2013. ICT Innovations 2013. Advances in Intelligent Systems and Computing, vol 231. Springer, Heidelberg. https://doi.org/10.1007/978-3-319-01466-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-01466-1_6

  • Publisher Name: Springer, Heidelberg

  • Print ISBN: 978-3-319-01465-4

  • Online ISBN: 978-3-319-01466-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics