Abstract
This paper proposes an effective scoring scheme for feature selection in Text Mining, using characteristics of Small-World Phenomenon on the semantic networks of documents. Our focus is on the reservation of both syntactic and statistical information of words, rather than solely simple frequency summarization in prevailing scoring schemes, such as TFIDF. Experimental results on TREC dataset show that our scoring scheme outperforms the prevailing schemes.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Huang, T., Tian, Y., et al.: Towards a multilingual, multimedia and multimodal digital library platform. J. Zhejiang Univ. SCI 6A(11), 1188–1192 (2005)
Nelson, D.L., McEvoy, C.L., Schreiber, T.A.: The University of South Florida word association norms (1999), http://www.usf.edu/FreeAssociation
Fellbaum, C. (ed.): WordNet, an electronic lexical database. MIT Press, Cambridge (1998)
Zhu, M., Cai, Z., Cai, Q.: Automatic Keywords Extraction Of Chinese Document Using Small World Structure. In: Procs. of IEEE ICNLPKE (2003)
Cancho, I.R.F., Sole, R.: The small world of human language. In: Proc. R. Soc. London B (in press), also Santa Fe Institute working paper 01–03–016
Lyon, C., Nehaniv, C., Dickerson, B.: Entropy Indicators for Investigating Early Language Process, http://homepages.feis.herts.ac.uk/~comrcml/
Caldeira, S., Lobao, T., et al.: The Network of Concepts in Written Texts, http://arxiv.org/pdf/physics/0508066
Watts, D., Strogatz, S.: Collective dynamics of small-world networks. Nature 393, 440 (1998)
Latora, V., Marchiori, M.: Efficient Behavior of Small-World Networks. Phys. Rev. Lett. 87, art. No. 198701 (2001)
Sigman, M., Cecchi, G.: Global organization of the Wordnet lexicon. PNAS, USA 99, 1742–1747 (2002)
Newman, M.: The structure and function of networks. Comput. Phys. Comm. 147, 40–45 (2002)
Porter, M.: The Porter Stemming Algorithm (2005), http://www.tartarus.org/~martin/PorterStemmer
Steyvers, M., Tenenbaum, J.: The Large-Scale Structure of semantic networks: Statistical Analyses and a Model for Semantic Growth (2001), http://arxiv.org/abs/cond-mat/
Humphreys, J.: PhraseRate: An HTML Keyphrase Extractor. Technical report, University of California, Riverside (June 2002), http://infomine.ucr.edu/
Hu, Y., Xin, G., et al.: Title extraction from bodies of HTML documents and its application to web page retrieval. In: Proc. of SIGIR 2005, August 2005, Salvador, Bahia, Brazil (2005)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. of the 14th ICML 1997, pp. 412–420 (1997)
Giuffrida, G., Shek, E., Yang, J.: Knowledge-based metadata extraction from PostScript files. In: Proceedings of Fifth ACM Conference on Digital Libraries (2000)
Song, D., Bruza, P.D.: Towards Context-sensitive Information Inference. JASIST 54(4), 321–334 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, C., Tian, Y., Huang, T., Gao, W. (2006). Semantic Scoring Based on Small-World Phenomenon for Feature Selection in Text Mining. In: Li, X., Zaïane, O.R., Li, Z. (eds) Advanced Data Mining and Applications. ADMA 2006. Lecture Notes in Computer Science(), vol 4093. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11811305_70
Download citation
DOI: https://doi.org/10.1007/11811305_70
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37025-3
Online ISBN: 978-3-540-37026-0
eBook Packages: Computer ScienceComputer Science (R0)