Skip to main content

DIKEA: Domain-Independent Keyphrase Extraction Algorithm

  • Conference paper
AI 2012: Advances in Artificial Intelligence (AI 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7691))

Included in the following conference series:

  • 3589 Accesses

Abstract

This paper introduces a new domain-independent keyphrase extraction system (DIKEA). Keyphrase extraction is a challenging problem that automatically extracts or assigns keyphrases to documents and it can benefit many research areas such as information retrieval, particularly indexing, clustering, and summarization. A landmark research KEA (Keyphrase Extraction Algorithm) formulated the problem as a supervised machine learning problem and successfully applied a Naïve Bayes model to it, which showed great promise but the performance is not satisfactory. Its state-of-the-art extension KEA++ has a significantly improved performance but relies on a domain specific vocabulary which is often not available or not complete. This paper introduces a novel domain-independent approach and has three main contributions: utilising the largest online knowledge source—Wikipedia—for keyphrase candidate selection; presenting new features for keyphrase evaluation, including a Wikipedia-based feature–link probability; and evaluating a number of different learning algorithms, including multilayer perceptrons, for keyphrase selection. Experiments show that our system clearly outperforms KEA and closely matches the performance of KEA++, without requiring any domain-specific knowledge such as KEA++’s vocabulary list.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: practical automatic keyphrase extraction. In: Proceedings of the Fourth ACM Conference on Digital Libraries, DL 1999, pp. 254–255. ACM, New York (1999)

    Chapter  Google Scholar 

  2. Turney, P.D.: Coherent keyphrase extraction via web mining. CoRR cs.LG/0308033 (2003)

    Google Scholar 

  3. Kelleher, D., Luz, S.: Automatic hypertext keyphrase detection. In: Kaelbling, L.P., Saffiotti, A. (eds.) IJCAI, pp. 1608–1609. Professional Book Center (2005)

    Google Scholar 

  4. Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Japan (2003)

    Google Scholar 

  5. Medelyan, O., Witten, I.H.: Thesaurus based automatic keyphrase indexing. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2006, pp. 296–297. ACM, New York (2006)

    Google Scholar 

  6. Medelyan, O., Witten, I.H.: Domain independent automatic keyphrase indexing with small training sets. J. Am. Soc. Information Science and Technology (2008)

    Google Scholar 

  7. Xu, Y., Jones, G.J.F., Wang, B.: Query dependent pseudo-relevance feedback based on wikipedia. In: Allan, J., Aslam, J.A., Sanderson, M., Zhai, C., Zobel, J. (eds.) SIGIR, pp. 59–66. ACM (2009)

    Google Scholar 

  8. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007)

    Google Scholar 

  9. Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering Documents Using a Wikipedia-Based Concept Representation. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 628–636. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. Egozi, O., Gabrilovich, E., Markovitch, S.: Concept-based feature generation and selection for information retrieval. In: Fox, D., Gomes, C.P. (eds.) AAAI, pp. 1132–1137. AAAI Press (2008)

    Google Scholar 

  11. Kim, S.N., Medelyan, O., Kan, M.Y., Baldwin, T.: Semeval-2010 task 5: Automatic keyphrase extraction from scientific articles. In: Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval 2010, pp. 21–26. Association for Computational Linguistics, Stroudsburg (2010)

    Google Scholar 

  12. Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000)

    Article  Google Scholar 

  13. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 1, pp. 257–266. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  14. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools 13(1), 157–169 (2004)

    Article  Google Scholar 

  15. Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)

    Google Scholar 

  16. University of Waikato, N.Z.: Wikipedia miner, http://wikipedia-miner.cms.waikato.ac.nz/index.html (accessed March 25, 2012)

  17. Milne, D., Witten, I.: An open-source toolkit for mining Wikipedia. In: Proc. New Zealand Computer Science Research Student Conf., NZCSRSC, vol. 9 (2009)

    Google Scholar 

  18. Mihalcea, R., Csomai, A.: Wikify!: linking documents to encyclopedic knowledge. In: Silva, M.J., Laender, A.H.F., Baeza-Yates, R.A., McGuinness, D.L., Olstad, B., Olsen, Ø.H., Falcão, A.O. (eds.) CIKM, pp. 233–242. ACM (2007)

    Google Scholar 

  19. Bouckaert, R.R., Frank, E., Hall, M.A., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: Weka—experiences with a java open-source project. J. Mach. Learn. Res. 11, 2533–2541 (2010)

    MATH  Google Scholar 

  20. Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 3, pp. 1318–1327. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  21. Leatherdale, D.: Food, Agricultural Organisation of the United Nations, Commision of the European Communities: AGROVOC: a multilingual thesaurus of agricultural terminology. Agrovoc : thesaurus multilingue de terminologie agricole. Apimondia by arrangement with the CEC (1982)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, D.X., Gao, X., Andreae, P. (2012). DIKEA: Domain-Independent Keyphrase Extraction Algorithm. In: Thielscher, M., Zhang, D. (eds) AI 2012: Advances in Artificial Intelligence. AI 2012. Lecture Notes in Computer Science(), vol 7691. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35101-3_61

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35101-3_61

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35100-6

  • Online ISBN: 978-3-642-35101-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics