Skip to main content

Exploratory Class-Imbalanced and Non-identical Data Distribution in Automatic Keyphrase Extraction

  • Conference paper
Advances in Neural Networks – ISNN 2012 (ISNN 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7368))

Included in the following conference series:

  • 3281 Accesses

Abstract

While supervised learning algorithms hold much promise for automatic keyphrase extraction, most of them presume that the samples are evenly distributed among different classes as well as drawn from an identical distribution, which, however, may not be the case in the real-world task of extracting keyphrases from documents. In this paper, we propose a novel supervised keyphrase extraction approach which deals with the problems of class-imbalanced and non-identical data distributions in automatic keyphrase extraction. Our approach is by nature a stacking approach where meta-models are trained on balanced partitions of a given training set and then combined through introducing meta-features describing particular keyphrase patterns embedded in each document. Experimental results verify the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lehtonen, M., Doucet, A.: Enhancing Keyword Search with a Keyphrase Index. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 65–70. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  2. Wu, X., Bolivar, A.: Keyword extraction for contextual advertisement. In: Proceedings of the 17th WWW, pp. 1195–1196 (2008)

    Google Scholar 

  3. Witten, I.H., Paynter, G.W., Frank, E.: KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of the 4th JCDL, pp. 254–255 (1999)

    Google Scholar 

  4. Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, 303–336 (2000)

    Article  Google Scholar 

  5. He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE TKDE 21(9), 1263–1284 (2009)

    Google Scholar 

  6. Weiss, G.M., Provost, F.: The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report, Department of Computer Science, Rutgers University (2001)

    Google Scholar 

  7. Nguyen, T.D., Kan, M.-Y.: Keyphrase Extraction in Scientific Publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  8. Li, Z., Zhou, D., Juan, Y., Han, J.: Keyword Extraction for Social Snippets. In: Proceedings of the 19th WWW, pp. 1143–1144 (2010)

    Google Scholar 

  9. Yih, W., Goodman, J., Carvalho, V.R.: Finding Advertising Keywords on Web Pages. In: Proceedings of the 15th WWW, pp. 213–222 (2006)

    Google Scholar 

  10. Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of the 1st EMNLP, pp. 404–411 (2004)

    Google Scholar 

  11. Litvak, M., Last, M.: Graph-Based Keyword Extraction for Single-Document Summarization. In: Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24 (2008)

    Google Scholar 

  12. Wan, X., Xiao, J.: CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In: Proceedings of the 22nd CICLing, pp. 969–976 (2008)

    Google Scholar 

  13. Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic Keyphrase Extraction via Topic Decomposition. In: Proceedings of the 7th EMNLP, pp. 366–376 (2010)

    Google Scholar 

  14. Liu, X., Wu, J., Zhou, Z.: Exploratory Under-Sampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B 39, 539–550 (2009)

    Article  Google Scholar 

  15. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 6, 321–357 (2002)

    Google Scholar 

  16. Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)

    Article  MATH  Google Scholar 

  17. Zhou, Z., Liu, X.: Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering 18, 63–77 (2006)

    Article  Google Scholar 

  18. Breiman, L.: Stacked regressions. Machine Learning 24, 49–64 (1999)

    Google Scholar 

  19. Dzeroski, S., Zenko, B.: Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning 54, 255–273 (2004)

    Article  MATH  Google Scholar 

  20. Xu, L., Jiang, J., Zhou, Y., Wu, H., Shen, G., Yu, R.: MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration. Chemometrics and Intelligent Laboratory Systems 87, 226–230 (2007)

    Article  Google Scholar 

  21. Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking classifiers for anti-spam filtering of E-mail. In: Proceedings of the 6th EMNLP, pp. 44–50 (2001)

    Google Scholar 

  22. Sill, J., Takacs, G., Mackey, L., Lin, D.: Feature-Weighted Linear Stacking. arXiv:0911.0460 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ni, W., Liu, T., Zeng, Q. (2012). Exploratory Class-Imbalanced and Non-identical Data Distribution in Automatic Keyphrase Extraction. In: Wang, J., Yen, G.G., Polycarpou, M.M. (eds) Advances in Neural Networks – ISNN 2012. ISNN 2012. Lecture Notes in Computer Science, vol 7368. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31362-2_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31362-2_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31361-5

  • Online ISBN: 978-3-642-31362-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics