skip to main content
10.1145/2095536.2095548acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

A web content mining approach for tag cloud generation

Published:05 December 2011Publication History

ABSTRACT

Tag cloud, also known as word cloud, are very useful for quickly perceiving the most prominent terms embedded within a text collection to determine their relative prominence. The effectiveness of tag clouds to conceptualize a text corpus is directly proportional to the quality of the keyphrases extracted from the corpus. Although, authors provide a list of about five to ten keywords in scientific publications that are used to map them into their respective domain, due to exponential growth in non-scientific documents on the World Wide Web, an automatic mechanism is sought to identify keyphrases embedded within them for tag cloud generation. In this paper, we propose a web content mining technique to extract keyphrases from web documents for tag cloud generation. Instead of using partial or full parsing, the proposed method applies n-gram technique followed by various heuristics-based refinements to identify a set of lexical and semantic features from text documents. We propose a rich set of domain-independent features to model candidate keyphrases very effectively for establishing their keyphraseness using classification models. We also propose a font-determination function to determine the relative font-size of keyphrases for tag cloud generation. The efficacy of the proposed method is established through experimentation. The proposed method outperforms the popular keyphrase extraction system KEA.

References

  1. Sinclair J. and Cardew-Hall, M. 2008. The folksonomy tag cloud: when is it useful? Journal of Information Science, 34(1), 15--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Zha, H. 2002. Generic Summarization and Keyphrase Extraction using Mutual Reinforcement Principle and Sentence Clustering. In Proceedings of the 25 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 113--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jones, S. and Staveley, M. S. 1999. Phrasier: A System for Interactive Document Retrieval using Keyphrases. In Proceedings of the 22 nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 160--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Han, J., Kim, T. and Choi, J. 2007. Web Document Clustering by using Automatic keyphrase extraction. In Proceedings of the 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, 56--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gutwin, C., Paynter, G., Witten, I. H., Nevill-Manning, C. and Frank, E. 1999. Improving Browsing in Digital Libraries with Keyphrase Indexes. Decision Support Systems, 27(1--2), 81--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Li, Q., Wu, Y. B., Bot, R. and Chen, X. 2004. Incorporating Document Keyphrases in Search Results. In Proceedings of the 10 th American Conference on Information Systems, New York.Google ScholarGoogle Scholar
  7. Jonse, S. and Mahoui, M. 2000. Hierarchical Document Clustering using Automatically Extracted Keyphrase. In Proceedings of the 3 rd International Asian Conference on Digital Libraries, Seoul, Korea, 113--120.Google ScholarGoogle Scholar
  8. Kosovac, B., Vanier, D. J. and Froese, T. M. 2000. Use of Keyphrase Extraction Software for Creation of an AEC/FM Thesaurus. Journal of Information Technology in Construction, 5, 25--36.Google ScholarGoogle Scholar
  9. Gutwin, C., Paynter, G. W., Witten, I. H., Nevill-Manning, C. G. and Frank, E. 1999. Improving Browsing in Digital Libraries with Keyphrase Indexes. Journal of Decision Support Systems, 27, 81--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kupiec, J., Pedersen, J. and Chen, F. 1995. A Trainable Document Summarizer. In Proceedings of the SIGIR, ACM Press, 68--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Turney, P. D. 2000. Learning Algorithm for Keyphrase Extraction. Journal of Information Retrieval, 2(4), 303--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Turney, P. D. 1999. Learning to Extract Keyphrases from Text. National Research Council, Institute for Information Technology, Technical Report ERB-1057.Google ScholarGoogle Scholar
  13. Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C. and Nevill-Manning, C. G. 1999. Domain-specific Keyphrase Extraction. In Proceedings of the 16 th International Joint Conference on Artificial Intelligence, San Mateo, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Porter, M. F. 1980. An Algorithm for Suffix Stripping, Program, 14(3), 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  15. Salton, G., & McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Medelyan, O. and Witten, I. H. 2006. Thesaurus-Based Automatic Keyphrase Indexing, In Proceedings of the 6 th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, USA, 296--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Medelyan, O., Witten, I. H. and Milne, D. 2008. Topic Indexing with Wikipedia. In Proceedings of AAAI Workshop on Wikipedia and Artificial Intelligence: an Evolving Synergy, Chicago, USA. 19--24.Google ScholarGoogle Scholar
  18. Medelyan, O., Frank, E., and Witten, I. H. 2009. Human-Competitive Tagging using Automatic Keyphrase Extraction. In Proceedings of the International Conference of Empirical Methods in Natural Language Processing (EMNLP), Singapore, 1318--1327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fensel, D., Horrocks, I., Harmelen, F. van, McGuinness, D. L. and Patel-Schneider, P. 2001. OIL: Ontology Infrastructure to Enable the Semantic Web, IEEE Intelligent Systems, 16(2), 38--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Maron, M. E. and Kuhns, J. L. 1960. On Relevance, Probabilistic Indexing and Information Retrieval, Journal of the ACM, 7(3), 216--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Aula, A., Jhaveri, N. and Kaki, M. 2005. Information search and re-access strategies of experienced web users, In Proceedings of the 14th international conference on World Wide Web (WWW'05), 583--592. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kaser, O. and Lemire, D. 2007. TagCloud Drawing: Algorithms for Cloud Visualization, In Proceedings of the 16th International Conference on World Wide Web (WWW'07), Canada.Google ScholarGoogle Scholar
  23. Schrammel, J. Littner, M. and Tscheligi, M. 2009. Semantically structured tag clouds: an empirical evaluation of clustered presentation approaches, In Proceedings of the 27th International Conference on Human Factors in Computing Systems, 2037--2040. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Koutrika, G., Zadeh, Z. M. and Garcia-Molina, H. 2009. Data clouds: summarizing keyword search results over structured data, In Proceedings of the 12 th International Conference on Extending Database Technology: Advances in Database Technology, 391--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kuo, B. Y. Hentrich, T., Good, B. M. and Wilkinson, M. D. 2007. Tag clouds for summarizing web search results. In Proceedings of the 16 th International Conference on World Wide Web (WWW'07), 1203--1204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Song, Y., Zhuang, Z., Li, H., Zhao, Q., Li, J., Lee, W. and Giles, C. L. 2008. Realtime automatic tag recommendation. In Proceedings of the 31 st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '08), Singapore, 515--522. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A web content mining approach for tag cloud generation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      iiWAS '11: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
      December 2011
      572 pages
      ISBN:9781450307840
      DOI:10.1145/2095536

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 December 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader