Abstract
While supervised learning algorithms hold much promise for automatic keyphrase extraction, most of them presume that the samples are evenly distributed among different classes as well as drawn from an identical distribution, which, however, may not be the case in the real-world task of extracting keyphrases from documents. In this paper, we propose a novel supervised keyphrase extraction approach which deals with the problems of class-imbalanced and non-identical data distributions in automatic keyphrase extraction. Our approach is by nature a stacking approach where meta-models are trained on balanced partitions of a given training set and then combined through introducing meta-features describing particular keyphrase patterns embedded in each document. Experimental results verify the effectiveness of our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lehtonen, M., Doucet, A.: Enhancing Keyword Search with a Keyphrase Index. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 65–70. Springer, Heidelberg (2009)
Wu, X., Bolivar, A.: Keyword extraction for contextual advertisement. In: Proceedings of the 17th WWW, pp. 1195–1196 (2008)
Witten, I.H., Paynter, G.W., Frank, E.: KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of the 4th JCDL, pp. 254–255 (1999)
Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, 303–336 (2000)
He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE TKDE 21(9), 1263–1284 (2009)
Weiss, G.M., Provost, F.: The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report, Department of Computer Science, Rutgers University (2001)
Nguyen, T.D., Kan, M.-Y.: Keyphrase Extraction in Scientific Publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)
Li, Z., Zhou, D., Juan, Y., Han, J.: Keyword Extraction for Social Snippets. In: Proceedings of the 19th WWW, pp. 1143–1144 (2010)
Yih, W., Goodman, J., Carvalho, V.R.: Finding Advertising Keywords on Web Pages. In: Proceedings of the 15th WWW, pp. 213–222 (2006)
Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of the 1st EMNLP, pp. 404–411 (2004)
Litvak, M., Last, M.: Graph-Based Keyword Extraction for Single-Document Summarization. In: Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24 (2008)
Wan, X., Xiao, J.: CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In: Proceedings of the 22nd CICLing, pp. 969–976 (2008)
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic Keyphrase Extraction via Topic Decomposition. In: Proceedings of the 7th EMNLP, pp. 366–376 (2010)
Liu, X., Wu, J., Zhou, Z.: Exploratory Under-Sampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B 39, 539–550 (2009)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 6, 321–357 (2002)
Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)
Zhou, Z., Liu, X.: Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering 18, 63–77 (2006)
Breiman, L.: Stacked regressions. Machine Learning 24, 49–64 (1999)
Dzeroski, S., Zenko, B.: Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning 54, 255–273 (2004)
Xu, L., Jiang, J., Zhou, Y., Wu, H., Shen, G., Yu, R.: MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration. Chemometrics and Intelligent Laboratory Systems 87, 226–230 (2007)
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking classifiers for anti-spam filtering of E-mail. In: Proceedings of the 6th EMNLP, pp. 44–50 (2001)
Sill, J., Takacs, G., Mackey, L., Lin, D.: Feature-Weighted Linear Stacking. arXiv:0911.0460 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ni, W., Liu, T., Zeng, Q. (2012). Exploratory Class-Imbalanced and Non-identical Data Distribution in Automatic Keyphrase Extraction. In: Wang, J., Yen, G.G., Polycarpou, M.M. (eds) Advances in Neural Networks – ISNN 2012. ISNN 2012. Lecture Notes in Computer Science, vol 7368. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31362-2_38
Download citation
DOI: https://doi.org/10.1007/978-3-642-31362-2_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31361-5
Online ISBN: 978-3-642-31362-2
eBook Packages: Computer ScienceComputer Science (R0)