Abstract
Nowadays XML based big bibliographic datasets are common in different domains which provide meta data about articles published in that domain. They have well defined tags which give details of the year, title, authors, abstract, keywords, the type of article, the venue of publishing the article and other such specific details about each article. A lot of statistics can be extracted from this dataset. Most of the time the tag pertaining to domain sub topic information associated with the article will be absent in the dataset as it is not an article attribute. Hence for such statistics articles must be mapped to its associated sub domain. This paper investigates this problem and proposes a fast approach to find trending articles and hot topics from XML based big bibliographic datasets. The proposed framework uses domain ontology to first classify articles into its sub topics. Fast detection of hot topics, trending keywords and articles is achieved using novel Map Reduce algorithms implemented on a hadoop distributed framework. Performance comparison demonstrates that it outperforms its non-Map Reduce counterpart in quickly sorting out the trending keywords and titles in a particular hot topic from XML based bibliographic dataset.
Similar content being viewed by others
References
Ley M.: The DBLP computer science bibliography: evolution, research issues, perspectives. In: Proceedings of the 9th International Symposium on String Processing and Information Retrieval, pp. 1–10, Springer, London (2002)
Alwahaishi, S., Martinovič, J., Snášel, V.: Analysis of the DBLP publication classification using concept lattices. In: Digital Enterprise and Information Systems Communications in Computer and Information Science, vol. 194, pp. 99–108 (2011)
Biryukov, M., Dong, C.: Analysis of Computer Science Communities Based on DBLP Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, Vol. 6273, pp. 228–235. Springer, Berlin (2010)
Minks, S., Martinovic, J., Drazdilova, P., Slaninova, K.: Author cooperation based on terms of article titles from DBLP. In: Proceedings of the Third International Conference on Intelligent Human Computer Interaction (IHCI 2011), Prague, Czech Republic, pp. 281–290. Springer, Berlin (2011)
Obadi, G., Drazdilova, P., Hlavacek, L., Martinovic, J., Snasel, V. : A tolerance rough set based overlapping clustering for the DBLP Data. In: Proceedings of the International Conference on Web Intelligence and Intelligent Agent Technology, pp. 57–60. IEEE (2010)
Wartena, C., Brussee, R.: Topic detection by clustering keywords. In: Proceedings of the 19th International Conference on Database and Expert Systems Applications, pp. 54–58. IEEE Computer Society, Washington, DC (2008)
Griffiths, T.I., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. USA 101, 5228–5235 (2004)
Rathore, A.S., Devshri, R.: Performance of LDA and DCT models. J. Inf. Sci. 40(3), 281–292 (2014)
Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model for topical trends. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 424–433. ACM, New York (2006)
Krishna, S.M., Bhavani, S.D.: An efficient approach for text clustering based on frequent itemsets. Eur. J. Sci. Res. 42(3), 399–410 (2010)
Agarwal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann Publishers Inc, San Francisco (1994)
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442. ACM, New York (2002)
Abe, H., Tsumoto, S.: Evaluating a temporal pattern detection method for finding research keys in bibliographical data. In: Transactions on Rough Sets XIV. Lecture Notes in Computer Science, vol. 6600, pp. 1–17 (2011)
Decker, S. L., Aleman-Meza, B., Cameron, D., Arpinar, I. B.: Detection of Bursty and Emerging Trends towards Identification of Researchers at the Early Stage of Trends. (Tech. Rep. No. 11148065665). University of Georgia, Computer Science Department (2007)
Jun, S.: A Technology forecasting method using text mining and visual apriori algorithm. Appl. Math. Inf. Sci 8, 35–40 (2014)
Ma, J., Xu, W., Sun, Y., Turban, E., Wang, S., Liu, O.: An ontology-based text-mining method to cluster proposals for research project selection. IEEE Trans. Syst. Man Cybern. A 42(3), 784–790 (2012)
Punnarut, R., Sriharee, G.A.: A researcher expertise search system using ontology-based data mining. In: Proceedings of the Seventh Asia-Pacific Conference on Conceptual Modelling, vol. 110, pp 71–78. Australian Computer Society, Inc., Darling Hurst (2010)
Rajpathak, D.G.: An ontology based text mining system for knowledge discovery from the diagnosis data in the automotive domain. Comput. Ind. 64(5), 565–580 (2013)
Chen, L.-C., Kuo, P.-J., Liao, I.-E.: Ontology-based library recommender system using MapReduce. Clust. Comput. 18, 113–121 (2015)
Han, J.-S., Kim, G.-J.: A method of intelligent recommendation using task ontology. Clust. Comput. 17, 827–833 (2014)
Shubhankar, K., Singh, A.P., Pudi, V.: An efficient algorithm for topic ranking and modeling topic evolution. In: Database and Expert Systems Applications. Lecture Notes in Computer Science, vol. 6860, pp. 320–330. Springer (2011)
Shubhankar, K., Singh, A. P., Pudi, V.: A Frequent keyword-set based algorithm for topic modeling and clustering of research papers. In: Proceedings of the 3rd Conference on Data Mining and Optimization (DMO), pp 96–102. IEEE, Selangor (2011)
Pan, Y., Lu, W., Zhang, Y., Chiu, K.: A static load-balancing scheme for parallel XML parsing on multicore CPUs. In: Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid, CCGRID, Rio De Janeiro, pp. 351–362 (2007)
Chen, R., Liao, H.: ParaParse: A parallel method for XML parsing. In: Proceedings of IEEE 3rd International Conference on Communication Software and Networks (ICCSN), pp. 81–85 (2011)
Fen, Z., Yabin, X., Yanping, L.: Research on internet hot topic detection based on MapReduce architecture. In: Proceedings of the 4th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 01, pp 81–84. IEEE Computer Society, Washington, DC (2012)
Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18, 403–418 (2015)
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Swaraj, K.P., Manjula, D. A fast approach to identify trending articles in hot topics from XML based big bibliographic datasets. Cluster Comput 19, 837–848 (2016). https://doi.org/10.1007/s10586-016-0561-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0561-1