Skip to main content

Document Representations for Classification of Short Web-Page Descriptions

  • Conference paper
Data Warehousing and Knowledge Discovery (DaWaK 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4081))

Included in the following conference series:

  • 792 Accesses

Abstract

Motivated by applying Text Categorization to sorting Web search results, this paper describes an extensive experimental study of the impact of bag-of-words document representations on the performance of five major classifiers – Naïve Bayes, SVM, Voted Perceptron, kNN and C4.5. The texts represent short Web-page descriptions from the dmoz Open Directory Web-page ontology. Different transformations of input data: stemming, normalization, logtf and idf, together with dimensionality reduction, are found to have a statistically significant improving or degrading effect on classification performance measured by classical metrics – accuracy, precision, recall, F1 and F2. The emphasis of the study is not on determining the best document representation which corresponds to each classifier, but rather on describing the effects of every individual transformation on classification, together with their mutual relationships.

This work was supported by project Abstract Methods and Applications in Computer Science (no. 144017A), of the Serbian Ministry of Science and Environmental Protection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications, WIT Press, Southampton (2005)

    Google Scholar 

  2. Radovanović, M., Ivanović, M.: CatS: A classification-powered meta-search engine. In: Advances in Web Intelligence and Data Mining. Studies in Computational Intelligence, vol. 23, Springer, Heidelberg (2006)

    Google Scholar 

  3. Mladenić, D.: Text-learning and related intelligent agents. IEEE Intelligent Systems, Special Issue on Applications of Intelligent Information Retrieval 14(4), 44–54 (1999)

    Google Scholar 

  4. Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of ICML04, 21st International Conference on Machine Learning, Baniff, Canada (2004)

    Google Scholar 

  5. Leopold, E., Kindermann, J.: Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning 46, 423–444 (2002)

    Article  MATH  Google Scholar 

  6. Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F.: Vers la conception automatique de filtres d’informations efficaces. In: Proceedings of RFIA 2000, Reconnaissance des Formes et Intelligence Artificielle, pp. 129–137 (2000)

    Google Scholar 

  7. Wu, X., Srihari, R., Zheng, Z.: Document representation for one-class SVM. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, Springer, Heidelberg (2004)

    Google Scholar 

  8. Kibriya, A.M., Frank, E., Pfahringer, B., Holmes, G.: Multinomial naive bayes for text categorization revisited. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 488–499. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  9. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  10. Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of ICML 2003, 20th International Conference on Machine Learning (2003)

    Google Scholar 

  11. Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge (1999)

    Google Scholar 

  12. Freund, Y., Schapire, R.E.: Large margin classification using the perceptron algorithm. Machine Learning 37(3), 277–296 (1999)

    Article  MATH  Google Scholar 

  13. Aha, D., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)

    Google Scholar 

  14. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Radovanović, M., Ivanović, M. (2006). Document Representations for Classification of Short Web-Page Descriptions. In: Tjoa, A.M., Trujillo, J. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2006. Lecture Notes in Computer Science, vol 4081. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11823728_52

Download citation

  • DOI: https://doi.org/10.1007/11823728_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37736-8

  • Online ISBN: 978-3-540-37737-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics