Skip to main content
Log in

Efficiently mining frequent itemsets applied for textual aggregation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Text mining approaches are commonly used to discover relevant information and relationships in huge amounts of text data. The term data mining refers to methods for analyzing data with the objective of finding patterns that aggregate the main properties of the data. The merger between the data mining approaches and on-line analytical processing (OLAP) tools allows us to refine techniques used in textual aggregation. In this paper, we propose a novel aggregation function for textual data based on the discovery of frequent closed patterns in a generated documents/keywords matrix. Our contribution aims at using a data mining technique, mainly a closed pattern mining algorithm, to aggregate keywords. An experimental study on a real corpus of more than 700 scientific papers collected on Microsoft Academic Search shows that the proposed algorithm largely outperforms four state-of-the-art textual aggregation methods in terms of recall, precision, F-measure and runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.it-innovations.ae

  2. academic.research.microsoft.com/

References

  1. Frawley W, Piatetsky-Shapiro G, Matheus C (1992) Knowledge discovery in databases: an overview. AI Magazine, fall 1992, pp 213–228

  2. Palmerini P (2004) On performance of data mining: from algorithms to management systems for data exploration (Doctoral dissertation, PhD. Thesis: TD-2004-2, Universita CaFos- cari di Venezia)

  3. Han J, Fu Y (1996) Attribute-oriented induction in data mining, advances in knowledge discovery and data mining. AAAI Press/The MIT Press, pp 399–421

  4. Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufman, San Mateo. ISBN: 1-55860489-8

    MATH  Google Scholar 

  5. Chen SY, Liu X (2005) Data mining from 1994 to 2004: an application-oriented review. Int J Business Intell Data Min 1(1):4–11

    Article  Google Scholar 

  6. Baeza-Yates R, Moffat A, Navarro G (2002) Searching large text collections. Handbook of massive data sets, pp 195–243. ISBN:1- 4020-0489-3

  7. Navathe S, Ramez E (2000) Fundamentals of database systems. Pearson Education, Singapore

    MATH  Google Scholar 

  8. Wu S-T, Li Y, Xu Y (2006) Deploying approaches for pattern refinement in text mining. In: Proceedings of the sixth IEEE international conference on data mining, pp 1157–1161

  9. Gupta V, Gurpreet SL (2009) A survey of text mining techniques and applications. J Emerg Technol Web Intell 1(1):60–76

    Google Scholar 

  10. Delgado M, Martín-Bautista MJ, Sánchez D, Vila MA (2002) Mining text data: special features and patterns. In: Proceedings of EPS exploratory workshop on pattern detection and discovery in data mining, London

  11. Kodratoff Y (1999) Knowledge discovery in texts: a definition and applications. In: Proceedings of the 11th international symposium on foundations of intelligent systems, pp 16–29

  12. Tan A-H (1999) Text mining: the state of the art and the challenges. In: Proceedings of the PAKDD workshop on knowledge discovery from advanced databases, pp 65–70

  13. Nasukawa T, Nagano T (2001) Text analysis and knowledge mining system. IBM Syst J 40(4):967–984

    Article  Google Scholar 

  14. Mooney RJ, Bunescu R (2005) Mining knowledge from text using information extraction. ACM SIGKDD Explor Newsl 7(1):3–10

    Article  Google Scholar 

  15. Leopold E, Kindermann J (2002) Text categorization with support vector machines. How represent texts in input space? Mach Learn 46:423–444

    Article  MATH  Google Scholar 

  16. Xu X, Mete M, Yuruk N (2005) Mining concept associations for knowledge discovery in large textual databases. In: Proceedings of the 2005 ACM symposium on applied computing. ACM, pp 549–550. ISO 690

  17. Mahgoub H (2006) Mining Association rules from unstructured documents. World Acad Sci Eng Technol 20 (1):1–6

    Google Scholar 

  18. Poudat C, Cleuziou G, Clavier V (2006) Catgorisation de textes en domaines et genres. Doc Numrique 9(1):61–76

    Article  Google Scholar 

  19. Bouakkaz M, Loudcher S, Ouinten Y (2016) OLAP textual aggregation approach using the Google similarity distance. Int J Bus Intell Data Min 11(1):31–48

    Article  Google Scholar 

  20. Ravat F, Teste O, Tournier R (2007) OLAP aggregation function for textual data warehouse. In: International conference on enterprise information systems, pp 151–156

  21. Oukid L, Asfari O, Bentayeb F, Benblidia N, Boussaid O (2013) CXT-cube: contextual text cube model and aggregation operator for text OLAP. In: Proceedings of the sixteenth international workshop on Data warehousing and OLAP. ACM, pp 27–32

  22. Mukherjee S, Joshi S (2014) Author-specific sentiment aggregation for polarity prediction of reviews. In: Ninth international conference on language resources and evaluation. ELRA, pp 3092–3099

  23. Ravi K, Ravi V (2015) A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl-Based Syst 89:14–46

    Article  Google Scholar 

  24. Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. Association for Computational Linguistics

  25. Blei D M, Lafferty J D (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120

  26. Bouakkaz M, Loudcher S, Ouinten Y (2014) Automatic textual aggregation approach of scientific articles in OLAP context. In: 2014 10th international conference on innovations in information technology (INNOVATIONS). IEEE, pp 30–35

  27. Oyedotun O K, Khashman A (2016) Document segmentation using textural features summarization and feedforward neural network. Appl Intell 45(1):198–212

    Article  Google Scholar 

  28. Hossain M M, Prybutok V R (2016) Towards developing a business performance management model using causal latent semantic analysis. Int J Bus Perform Manag 17(2):161–183

    Article  Google Scholar 

  29. Lauw H W, Lim E P, Pang H (2007) Discovering documentary evidence of associations among entities. In: Proceedings of the 2007 ACM symposium on applied computing. ACM, pp 824–828

  30. Bringay S, Bchet N, Bouillot F, Poncelet P, Roche M, Teisseire M (2011) Towards an on-line analysis of tweets processing. In: Database and expert systems applications. Springer, Berlin, pp 154– 161

  31. Wartena C, Brussee R (2008) Topic detection by clustering keywords. In: 19th international workshop on database and expert systems application. DEXA’08. IEEE, pp 54–58

  32. Fuglede B, Topsoe F (2004) Jensen-Shannon divergence and Hilbert space embedding. In: IEEE international symposium on information theory, pp 31–31

  33. Ravat F, Teste O, Tournier R, Zurfluh G (2008) Top Keyword: an aggregation function for textual document OLAP. In: Data warehousing and knowledge discovery. Springer, Berlin, pp 55–64

  34. Li J, Li L, Li T (2012) Multi-document summarization via submodularity. Appl Intell 37(3):420–430

    Article  Google Scholar 

  35. Frantzi K, Ananiadou S, Mima H (2000) Automatic recognition of multi-word terms: the c-value/nc-value method. Int J Digit Libr 3(2):115–130

    Article  Google Scholar 

  36. El-Ghannam F, El-Shishtawy T (2014) Multi-topic multi-document summarizer. arXiv preprint arXiv:1401.0640

  37. Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347– 1362

    Article  Google Scholar 

  38. Fournier-Viger P, Gomariz A, Gueniche T, Soltani A, Wu C W, Tseng V S (2014) SPMF: a Java open-source pattern mining library. J Mach Learn Res 15(1):3389–3393

    MATH  Google Scholar 

  39. Wang T, Chen P, Simovici D (2016) A new evaluation measure using compression dissimilarity on text summarization. Appl Intell 45(1):127–134

    Article  Google Scholar 

  40. Sutcliffe T (1992) Measuring the informativeness of a retrieval process. In: Proceedings of SIGIR, pp 23–36

  41. Jones K, Willett P (1997) Readings in information retrieval. Morgan Kaufmann, San Mateo

    Google Scholar 

  42. Trec: common evaluation measures. The twenty-second Text REtrieval conference. http://trec.nist.gov/pubs/trec22/trec2015.html (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mustapha Bouakkaz.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bouakkaz, M., Ouinten, Y., Loudcher, S. et al. Efficiently mining frequent itemsets applied for textual aggregation. Appl Intell 48, 1013–1019 (2018). https://doi.org/10.1007/s10489-017-1050-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-1050-9

Keywords

Navigation