Abstract
Guaranteeing the quality of extracted features that describe relevant knowledge to users or topics is a challenge because of the large number of extracted features. Most popular existing term-based feature selection methods suffer from noisy feature extraction, which is irrelevant to the user needs (noisy). One popular method is to extract phrases or n-grams to describe the relevant knowledge. However, extracted n-grams and phrases usually contain a lot of noise. This paper proposes a method for reducing the noise in n-grams. The method first extracts more specific features (terms) to remove noisy features. The method then uses an extended random set to accurately weight n-grams based on their distribution in the documents and their terms distribution in n-grams. The proposed approach not only reduces the number of extracted n-grams but also improves the performance. The experimental results on Reuters Corpus Volume 1 (RCV1) data collection and TREC topics show that the proposed method significantly outperforms the state-of-art methods underpinned by Okapi BM25, tf*idf and Rocchio.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Wang, X., Fang, H., Zhai, C.: A study of methods for negative relevance feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–226. ACM (2008)
Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 697–702. IEEE (2007)
Liu, H., Motoda, H., Setiono, R., Zhao, Z.: Feature selection: An ever evolving frontier in data mining. In: Proc. The Fourth Workshop on Feature Selection in Data Mining, vol. 4, pp. 4–13 (2010)
Li, Y., Zhong, N.: Mining ontology for automatically acquiring web user information needs. IEEE Transactions on Knowledge and Data Engineering 18(4), 554–568 (2006)
Berry, M.W., Kogan, J.: Text mining: applications and theory. Wiley (2010)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics (2003)
Tandon, N., de Melo, G.: Information extraction from web-scale n-gram data. In: Web N-gram Workshop, vol. 7, Citeseer (2010)
Wei, Z., Chauchat, J., Miao, D.: Comparing different text representation and feature selection methods on chinese text classification using character n-grams. Journées Internationnales d’Analyse des Données Textuelles, 1175–1186 (2008)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)
Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Neslia Paniculata (2001)
Wang, K., Thrasher, C., Viegas, E., Li, X., Hsu, B.j.P.: An overview of microsoft web n-gram corpus and applications. In: Proceedings of the NAACL HLT 2010 Demonstration Session, pp. 45–48. Association for Computational Linguistics (2010)
Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010, pp. 753–762. ACM, New York (2010)
Wu, S.T.: Knowledge discovery using pattern taxonomy model in text mining. PhD thesis, Queensland University of Technology (2007)
Liu, B.: Web data mining: exploring hyperlinks, contents, and usage data. Springer (2007)
Wei, Z., Miao, D., Chauchat, J.H., Zhao, R., Li, W.: N-grams based feature selection and text representation for chinese text classification. International Journal of Computational Intelligence Systems 2(4), 365–374 (2009)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. The Journal of Machine Learning Research 3, 1157–1182 (2003)
Fürnkranz, J.: A study using n-gram features for text categorization. Austrian Research Institute for Artifical Intelligence 3(1998), 1–10 (1998)
Bertolami, R., Bunke, H.: Integration of n-gram language models in multiple classifier systems for offline handwritten text line recognition. International Journal of Pattern Recognition and Artificial Intelligence 22(07), 1301–1321 (2008)
Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003)
Joachims, T.: A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, DTIC Document (1996)
Robertson, S., Soboroff, I.: The trec 2002 filtering track report. In: Text REtrieval Conference (2002)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information processing & management 24(5), 513–523 (1988)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Albathan, M., Li, Y., Algarni, A. (2013). Enhanced N-Gram Extraction Using Relevance Feature Discovery. In: Cranefield, S., Nayak, A. (eds) AI 2013: Advances in Artificial Intelligence. AI 2013. Lecture Notes in Computer Science(), vol 8272. Springer, Cham. https://doi.org/10.1007/978-3-319-03680-9_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-03680-9_46
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-03679-3
Online ISBN: 978-3-319-03680-9
eBook Packages: Computer ScienceComputer Science (R0)