Document Representations for Classification of Short Web-Page Descriptions

Radovanović, Miloš; Ivanović, Mirjana

doi:10.1007/11823728_52

Miloš Radovanović¹⁸ &
Mirjana Ivanović¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4081))

Included in the following conference series:

International Conference on Data Warehousing and Knowledge Discovery

792 Accesses

Abstract

Motivated by applying Text Categorization to sorting Web search results, this paper describes an extensive experimental study of the impact of bag-of-words document representations on the performance of five major classifiers – Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts represent short Web-page descriptions from the dmoz Open Directory Web-page ontology. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics – accuracy, precision, recall, F₁ and F₂. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships.

This work was supported by project Abstract Methods and Applications in Computer Science (no. 144017A), of the Serbian Ministry of Science and Environmental Protection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improved Document Categorization Through Feature-Rich Combinations

A Probabilistic Vector Representation and Neural Network for Text Classification

Automatic Text Classification Using Neural Network and Statistical Approaches

References

Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications, WIT Press, Southampton (2005)
Google Scholar
Radovanović, M., Ivanović, M.: CatS: A classification-powered meta-search engine. In: Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol. 23, Springer, Heidelberg (2006)
Google Scholar
Mladenić, D.: Text-learning and related intelligent agents. IEEE Intelligent Systems, Special Issue on Applications of Intelligent Information Retrieval 14(4), 44–54 (1999)
Google Scholar
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of ICML04, 21st International Conference on Machine Learning, Baniff, Canada (2004)
Google Scholar
Leopold, E., Kindermann, J.: Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning 46, 423–444 (2002)
Article MATH Google Scholar
Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F.: Vers la conception automatique de filtres d’informations efficaces. In: Proceedings of RFIA 2000, Reconnaissance des Formes et Intelligence Artificielle, pp. 129–137 (2000)
Google Scholar
Wu, X., Srihari, R., Zheng, Z.: Document representation for one-class SVM. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, Springer, Heidelberg (2004)
Google Scholar
Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial naive bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)
Chapter Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of ICML 2003, 20th International Conference on Machine Learning (2003)
Google Scholar
Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge (1999)
Google Scholar
Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Machine Learning 37(3), 277–296 (1999)
Article MATH Google Scholar
Aha, D., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)
Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Science, Department of Mathematics and Informatics, University of Novi Sad, Trg . Obradovića 4, 21000, Novi Sad, Serbia and Montenegro
Miloš Radovanović & Mirjana Ivanović

Authors

Miloš Radovanović
View author publications
You can also search for this author in PubMed Google Scholar
Mirjana Ivanović
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040, Wien, Austria
A Min Tjoa
Department of Software and Computing Systems, University of Alicante, Spain
Juan Trujillo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Radovanović, M., Ivanović, M. (2006). Document Representations for Classification of Short Web-Page Descriptions. In: Tjoa, A.M., Trujillo, J. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2006. Lecture Notes in Computer Science, vol 4081. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11823728_52

Download citation

DOI: https://doi.org/10.1007/11823728_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37736-8
Online ISBN: 978-3-540-37737-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Document Representations for Classification of Short Web-Page Descriptions

Abstract

Access this chapter

Preview

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

A Probabilistic Vector Representation and Neural Network for Text Classification

Automatic Text Classification Using Neural Network and Statistical Approaches

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Document Representations for Classification of Short Web-Page Descriptions

Abstract

Access this chapter

Preview

Similar content being viewed by others

Improved Document Categorization Through Feature-Rich Combinations

A Probabilistic Vector Representation and Neural Network for Text Classification

Automatic Text Classification Using Neural Network and Statistical Approaches

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation