Abstract
Text mining approaches are commonly used to discover relevant information and relationships in huge amounts of text data. The term data mining refers to methods for analyzing data with the objective of finding patterns that aggregate the main properties of the data. The merger between the data mining approaches and on-line analytical processing (OLAP) tools allows us to refine techniques used in textual aggregation. In this paper, we propose a novel aggregation function for textual data based on the discovery of frequent closed patterns in a generated documents/keywords matrix. Our contribution aims at using a data mining technique, mainly a closed pattern mining algorithm, to aggregate keywords. An experimental study on a real corpus of more than 700 scientific papers collected on Microsoft Academic Search shows that the proposed algorithm largely outperforms four state-of-the-art textual aggregation methods in terms of recall, precision, F-measure and runtime.
Similar content being viewed by others
References
Frawley W, Piatetsky-Shapiro G, Matheus C (1992) Knowledge discovery in databases: an overview. AI Magazine, fall 1992, pp 213–228
Palmerini P (2004) On performance of data mining: from algorithms to management systems for data exploration (Doctoral dissertation, PhD. Thesis: TD-2004-2, Universita CaFos- cari di Venezia)
Han J, Fu Y (1996) Attribute-oriented induction in data mining, advances in knowledge discovery and data mining. AAAI Press/The MIT Press, pp 399–421
Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufman, San Mateo. ISBN: 1-55860489-8
Chen SY, Liu X (2005) Data mining from 1994 to 2004: an application-oriented review. Int J Business Intell Data Min 1(1):4–11
Baeza-Yates R, Moffat A, Navarro G (2002) Searching large text collections. Handbook of massive data sets, pp 195–243. ISBN:1- 4020-0489-3
Navathe S, Ramez E (2000) Fundamentals of database systems. Pearson Education, Singapore
Wu S-T, Li Y, Xu Y (2006) Deploying approaches for pattern refinement in text mining. In: Proceedings of the sixth IEEE international conference on data mining, pp 1157–1161
Gupta V, Gurpreet SL (2009) A survey of text mining techniques and applications. J Emerg Technol Web Intell 1(1):60–76
Delgado M, Martín-Bautista MJ, Sánchez D, Vila MA (2002) Mining text data: special features and patterns. In: Proceedings of EPS exploratory workshop on pattern detection and discovery in data mining, London
Kodratoff Y (1999) Knowledge discovery in texts: a definition and applications. In: Proceedings of the 11th international symposium on foundations of intelligent systems, pp 16–29
Tan A-H (1999) Text mining: the state of the art and the challenges. In: Proceedings of the PAKDD workshop on knowledge discovery from advanced databases, pp 65–70
Nasukawa T, Nagano T (2001) Text analysis and knowledge mining system. IBM Syst J 40(4):967–984
Mooney RJ, Bunescu R (2005) Mining knowledge from text using information extraction. ACM SIGKDD Explor Newsl 7(1):3–10
Leopold E, Kindermann J (2002) Text categorization with support vector machines. How represent texts in input space? Mach Learn 46:423–444
Xu X, Mete M, Yuruk N (2005) Mining concept associations for knowledge discovery in large textual databases. In: Proceedings of the 2005 ACM symposium on applied computing. ACM, pp 549–550. ISO 690
Mahgoub H (2006) Mining Association rules from unstructured documents. World Acad Sci Eng Technol 20 (1):1–6
Poudat C, Cleuziou G, Clavier V (2006) Catgorisation de textes en domaines et genres. Doc Numrique 9(1):61–76
Bouakkaz M, Loudcher S, Ouinten Y (2016) OLAP textual aggregation approach using the Google similarity distance. Int J Bus Intell Data Min 11(1):31–48
Ravat F, Teste O, Tournier R (2007) OLAP aggregation function for textual data warehouse. In: International conference on enterprise information systems, pp 151–156
Oukid L, Asfari O, Bentayeb F, Benblidia N, Boussaid O (2013) CXT-cube: contextual text cube model and aggregation operator for text OLAP. In: Proceedings of the sixteenth international workshop on Data warehousing and OLAP. ACM, pp 27–32
Mukherjee S, Joshi S (2014) Author-specific sentiment aggregation for polarity prediction of reviews. In: Ninth international conference on language resources and evaluation. ELRA, pp 3092–3099
Ravi K, Ravi V (2015) A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl-Based Syst 89:14–46
Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. Association for Computational Linguistics
Blei D M, Lafferty J D (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120
Bouakkaz M, Loudcher S, Ouinten Y (2014) Automatic textual aggregation approach of scientific articles in OLAP context. In: 2014 10th international conference on innovations in information technology (INNOVATIONS). IEEE, pp 30–35
Oyedotun O K, Khashman A (2016) Document segmentation using textural features summarization and feedforward neural network. Appl Intell 45(1):198–212
Hossain M M, Prybutok V R (2016) Towards developing a business performance management model using causal latent semantic analysis. Int J Bus Perform Manag 17(2):161–183
Lauw H W, Lim E P, Pang H (2007) Discovering documentary evidence of associations among entities. In: Proceedings of the 2007 ACM symposium on applied computing. ACM, pp 824–828
Bringay S, Bchet N, Bouillot F, Poncelet P, Roche M, Teisseire M (2011) Towards an on-line analysis of tweets processing. In: Database and expert systems applications. Springer, Berlin, pp 154– 161
Wartena C, Brussee R (2008) Topic detection by clustering keywords. In: 19th international workshop on database and expert systems application. DEXA’08. IEEE, pp 54–58
Fuglede B, Topsoe F (2004) Jensen-Shannon divergence and Hilbert space embedding. In: IEEE international symposium on information theory, pp 31–31
Ravat F, Teste O, Tournier R, Zurfluh G (2008) Top Keyword: an aggregation function for textual document OLAP. In: Data warehousing and knowledge discovery. Springer, Berlin, pp 55–64
Li J, Li L, Li T (2012) Multi-document summarization via submodularity. Appl Intell 37(3):420–430
Frantzi K, Ananiadou S, Mima H (2000) Automatic recognition of multi-word terms: the c-value/nc-value method. Int J Digit Libr 3(2):115–130
El-Ghannam F, El-Shishtawy T (2014) Multi-topic multi-document summarizer. arXiv preprint arXiv:1401.0640
Grahne G, Zhu J (2005) Fast algorithms for frequent itemset mining using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347– 1362
Fournier-Viger P, Gomariz A, Gueniche T, Soltani A, Wu C W, Tseng V S (2014) SPMF: a Java open-source pattern mining library. J Mach Learn Res 15(1):3389–3393
Wang T, Chen P, Simovici D (2016) A new evaluation measure using compression dissimilarity on text summarization. Appl Intell 45(1):127–134
Sutcliffe T (1992) Measuring the informativeness of a retrieval process. In: Proceedings of SIGIR, pp 23–36
Jones K, Willett P (1997) Readings in information retrieval. Morgan Kaufmann, San Mateo
Trec: common evaluation measures. The twenty-second Text REtrieval conference. http://trec.nist.gov/pubs/trec22/trec2015.html (2015)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bouakkaz, M., Ouinten, Y., Loudcher, S. et al. Efficiently mining frequent itemsets applied for textual aggregation. Appl Intell 48, 1013–1019 (2018). https://doi.org/10.1007/s10489-017-1050-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-1050-9