ABSTRACT
Tag cloud, also known as word cloud, are very useful for quickly perceiving the most prominent terms embedded within a text collection to determine their relative prominence. The effectiveness of tag clouds to conceptualize a text corpus is directly proportional to the quality of the keyphrases extracted from the corpus. Although, authors provide a list of about five to ten keywords in scientific publications that are used to map them into their respective domain, due to exponential growth in non-scientific documents on the World Wide Web, an automatic mechanism is sought to identify keyphrases embedded within them for tag cloud generation. In this paper, we propose a web content mining technique to extract keyphrases from web documents for tag cloud generation. Instead of using partial or full parsing, the proposed method applies n-gram technique followed by various heuristics-based refinements to identify a set of lexical and semantic features from text documents. We propose a rich set of domain-independent features to model candidate keyphrases very effectively for establishing their keyphraseness using classification models. We also propose a font-determination function to determine the relative font-size of keyphrases for tag cloud generation. The efficacy of the proposed method is established through experimentation. The proposed method outperforms the popular keyphrase extraction system KEA.
- Sinclair J. and Cardew-Hall, M. 2008. The folksonomy tag cloud: when is it useful? Journal of Information Science, 34(1), 15--29. Google ScholarDigital Library
- Zha, H. 2002. Generic Summarization and Keyphrase Extraction using Mutual Reinforcement Principle and Sentence Clustering. In Proceedings of the 25 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 113--120. Google ScholarDigital Library
- Jones, S. and Staveley, M. S. 1999. Phrasier: A System for Interactive Document Retrieval using Keyphrases. In Proceedings of the 22 nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 160--167. Google ScholarDigital Library
- Han, J., Kim, T. and Choi, J. 2007. Web Document Clustering by using Automatic keyphrase extraction. In Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 56--59. Google ScholarDigital Library
- Gutwin, C., Paynter, G., Witten, I. H., Nevill-Manning, C. and Frank, E. 1999. Improving Browsing in Digital Libraries with Keyphrase Indexes. Decision Support Systems, 27(1--2), 81--104. Google ScholarDigital Library
- Li, Q., Wu, Y. B., Bot, R. and Chen, X. 2004. Incorporating Document Keyphrases in Search Results. In Proceedings of the 10 th American Conference on Information Systems, New York.Google Scholar
- Jonse, S. and Mahoui, M. 2000. Hierarchical Document Clustering using Automatically Extracted Keyphrase. In Proceedings of the 3 rd International Asian Conference on Digital Libraries, Seoul, Korea, 113--120.Google Scholar
- Kosovac, B., Vanier, D. J. and Froese, T. M. 2000. Use of Keyphrase Extraction Software for Creation of an AEC/FM Thesaurus. Journal of Information Technology in Construction, 5, 25--36.Google Scholar
- Gutwin, C., Paynter, G. W., Witten, I. H., Nevill-Manning, C. G. and Frank, E. 1999. Improving Browsing in Digital Libraries with Keyphrase Indexes. Journal of Decision Support Systems, 27, 81--104. Google ScholarDigital Library
- Kupiec, J., Pedersen, J. and Chen, F. 1995. A Trainable Document Summarizer. In Proceedings of the SIGIR, ACM Press, 68--73. Google ScholarDigital Library
- Turney, P. D. 2000. Learning Algorithm for Keyphrase Extraction. Journal of Information Retrieval, 2(4), 303--36. Google ScholarDigital Library
- Turney, P. D. 1999. Learning to Extract Keyphrases from Text. National Research Council, Institute for Information Technology, Technical Report ERB-1057.Google Scholar
- Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C. and Nevill-Manning, C. G. 1999. Domain-specific Keyphrase Extraction. In Proceedings of the 16 th International Joint Conference on Artificial Intelligence, San Mateo, CA. Google ScholarDigital Library
- Porter, M. F. 1980. An Algorithm for Suffix Stripping, Program, 14(3), 130--137.Google ScholarCross Ref
- Salton, G., & McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. Google ScholarDigital Library
- Medelyan, O. and Witten, I. H. 2006. Thesaurus-Based Automatic Keyphrase Indexing, In Proceedings of the 6 th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, USA, 296--297. Google ScholarDigital Library
- Medelyan, O., Witten, I. H. and Milne, D. 2008. Topic Indexing with Wikipedia. In Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, Chicago, USA. 19--24.Google Scholar
- Medelyan, O., Frank, E., and Witten, I. H. 2009. Human-Competitive Tagging using Automatic Keyphrase Extraction. In Proceedings of the International Conference of Empirical Methods in Natural Language Processing (EMNLP), Singapore, 1318--1327. Google ScholarDigital Library
- Fensel, D., Horrocks, I., Harmelen, F. van, McGuinness, D. L. and Patel-Schneider, P. 2001. OIL: Ontology Infrastructure to Enable the Semantic Web, IEEE Intelligent Systems, 16(2), 38--45. Google ScholarDigital Library
- Maron, M. E. and Kuhns, J. L. 1960. On Relevance, Probabilistic Indexing and Information Retrieval, Journal of the ACM, 7(3), 216--244. Google ScholarDigital Library
- Aula, A., Jhaveri, N. and Kaki, M. 2005. Information search and re-access strategies of experienced web users, In Proceedings of the 14th international conference on World Wide Web (WWW'05), 583--592. Google ScholarDigital Library
- Kaser, O. and Lemire, D. 2007. TagCloud Drawing: Algorithms for Cloud Visualization, In Proceedings of the 16th International Conference on World Wide Web (WWW'07), Canada.Google Scholar
- Schrammel, J. Littner, M. and Tscheligi, M. 2009. Semantically structured tag clouds: an empirical evaluation of clustered presentation approaches, In Proceedings of the 27th International Conference on Human Factors in Computing Systems, 2037--2040. Google ScholarDigital Library
- Koutrika, G., Zadeh, Z. M. and Garcia-Molina, H. 2009. Data clouds: summarizing keyword search results over structured data, In Proceedings of the 12 th International Conference on Extending Database Technology: Advances in Database Technology, 391--402. Google ScholarDigital Library
- Kuo, B. Y. Hentrich, T., Good, B. M. and Wilkinson, M. D. 2007. Tag clouds for summarizing web search results. In Proceedings of the 16 th International Conference on World Wide Web (WWW'07), 1203--1204. Google ScholarDigital Library
- Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W. and Giles, C. L. 2008. Realtime automatic tag recommendation. In Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '08), Singapore, 515--522. Google ScholarDigital Library
Index Terms
- A web content mining approach for tag cloud generation
Recommendations
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System
The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from ...
Accurate keyphrase extraction by discriminating overlapping phrases
In this paper we define the document phrase maximality index DPM-index, a new measure to discriminate overlapping keyphrase candidates in a text document. As an application we developed a supervised learning system that uses 18 statistical features, ...
Domain-specific keyphrase extraction
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementDocument keyphrases provide semantic metadata characterizing documents and producing an overview of the content of a document. They can be used in many text-mining and knowledge management related applications. This paper describes a Keyphrase ...
Comments