Exploratory Class-Imbalanced and Non-identical Data Distribution in Automatic Keyphrase Extraction

Ni, Weijian; Liu, Tong; Zeng, Qingtian

doi:10.1007/978-3-642-31362-2_38

Weijian Ni¹⁹,
Tong Liu¹⁹ &
Qingtian Zeng¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7368))

Included in the following conference series:

International Symposium on Neural Networks

3281 Accesses

Abstract

While supervised learning algorithms hold much promise for automatic keyphrase extraction, most of them presume that the samples are evenly distributed among different classes as well as drawn from an identical distribution, which, however, may not be the case in the real-world task of extracting keyphrases from documents. In this paper, we propose a novel supervised keyphrase extraction approach which deals with the problems of class-imbalanced and non-identical data distributions in automatic keyphrase extraction. Our approach is by nature a stacking approach where meta-models are trained on balanced partitions of a given training set and then combined through introducing meta-features describing particular keyphrase patterns embedded in each document. Experimental results verify the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lehtonen, M., Doucet, A.: Enhancing Keyword Search with a Keyphrase Index. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2008. LNCS, vol. 5631, pp. 65–70. Springer, Heidelberg (2009)
Chapter Google Scholar
Wu, X., Bolivar, A.: Keyword extraction for contextual advertisement. In: Proceedings of the 17th WWW, pp. 1195–1196 (2008)
Google Scholar
Witten, I.H., Paynter, G.W., Frank, E.: KEA: Practical Automatic Keyphrase Extraction. In: Proceedings of the 4th JCDL, pp. 254–255 (1999)
Google Scholar
Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval 2, 303–336 (2000)
Article Google Scholar
He, H., Garcia, E.A.: Learning from Imbalanced Data. IEEE TKDE 21(9), 1263–1284 (2009)
Google Scholar
Weiss, G.M., Provost, F.: The Effect of Class Distribution on Classifier Learning: An Empirical Study. Technical Report, Department of Computer Science, Rutgers University (2001)
Google Scholar
Nguyen, T.D., Kan, M.-Y.: Keyphrase Extraction in Scientific Publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)
Chapter Google Scholar
Li, Z., Zhou, D., Juan, Y., Han, J.: Keyword Extraction for Social Snippets. In: Proceedings of the 19th WWW, pp. 1143–1144 (2010)
Google Scholar
Yih, W., Goodman, J., Carvalho, V.R.: Finding Advertising Keywords on Web Pages. In: Proceedings of the 15th WWW, pp. 213–222 (2006)
Google Scholar
Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proceedings of the 1st EMNLP, pp. 404–411 (2004)
Google Scholar
Litvak, M., Last, M.: Graph-Based Keyword Extraction for Single-Document Summarization. In: Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24 (2008)
Google Scholar
Wan, X., Xiao, J.: CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction. In: Proceedings of the 22nd CICLing, pp. 969–976 (2008)
Google Scholar
Liu, Z., Huang, W., Zheng, Y., Sun, M.: Automatic Keyphrase Extraction via Topic Decomposition. In: Proceedings of the 7th EMNLP, pp. 366–376 (2010)
Google Scholar
Liu, X., Wu, J., Zhou, Z.: Exploratory Under-Sampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B 39, 539–550 (2009)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research 6, 321–357 (2002)
Google Scholar
Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007)
Article MATH Google Scholar
Zhou, Z., Liu, X.: Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem. IEEE Transactions on Knowledge and Data Engineering 18, 63–77 (2006)
Article Google Scholar
Breiman, L.: Stacked regressions. Machine Learning 24, 49–64 (1999)
Google Scholar
Dzeroski, S., Zenko, B.: Is Combining Classifiers with Stacking Better than Selecting the Best One? Machine Learning 54, 255–273 (2004)
Article MATH Google Scholar
Xu, L., Jiang, J., Zhou, Y., Wu, H., Shen, G., Yu, R.: MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration. Chemometrics and Intelligent Laboratory Systems 87, 226–230 (2007)
Article Google Scholar
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C.D., Stamatopoulos, P.: Stacking classifiers for anti-spam filtering of E-mail. In: Proceedings of the 6th EMNLP, pp. 44–50 (2001)
Google Scholar
Sill, J., Takacs, G., Mackey, L., Lin, D.: Feature-Weighted Linear Stacking. arXiv:0911.0460 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Shandong University of Science and Technology, Qingdao, Shandong Province, 266510, P.R. China
Weijian Ni, Tong Liu & Qingtian Zeng

Authors

Weijian Ni
View author publications
You can also search for this author in PubMed Google Scholar
Tong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qingtian Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mechanical & Automation Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
Jun Wang
School of Electrical and Computer Engineering, Oklahoma State University, 74078, Stillwater, OK, USA
Gary G. Yen
Department of Electrical and Computer Engineering, University of Cyprus, 75 Kallipoleos Avenue, 1678, Nicosia, Cyprus
Marios M. Polycarpou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ni, W., Liu, T., Zeng, Q. (2012). Exploratory Class-Imbalanced and Non-identical Data Distribution in Automatic Keyphrase Extraction. In: Wang, J., Yen, G.G., Polycarpou, M.M. (eds) Advances in Neural Networks – ISNN 2012. ISNN 2012. Lecture Notes in Computer Science, vol 7368. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31362-2_38

Download citation

DOI: https://doi.org/10.1007/978-3-642-31362-2_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31361-5
Online ISBN: 978-3-642-31362-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics