Effectiveness of Document Representation for Classification

Chen, Ding-Yi; Li, Xue; Dong, Zhao Yang; Chen, Xia

doi:10.1007/11546849_36

Effectiveness of Document Representation for Classification

Ding-Yi Chen¹⁸,
Xue Li¹⁸,
Zhao Yang Dong¹⁸ &
…
Xia Chen¹⁸

Conference paper

1527 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3589))

Abstract

Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS) 12, 233–251 (1994)
Article Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the seventh international conference on Information and knowledge management, Bethesda, Maryland, United States, pp. 148–155. ACM Press, New York (1998)
Chapter Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, Copenhagen, Denmark, pp. 37–50. ACM Press, New York (1992)
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34, 1–47 (2002)
Article MathSciNet Google Scholar
Chen, D.Y., Li, X., Dong, Z.Y., Chen, X.: Determining the fitness of a document model by using conflict instances. In: The Sixteenth Australasian Database Conference, pp. 125–134. Australian Computer Society Inc., Newcastle (2005)
Google Scholar
Robertson, S.E., Jones, K.S.: Relevance weighting of search terms. Journal of the American Society for Information Science 27, 129–146 (1976)
Article Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)
Article MATH Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science and Technology 41, 391–407 (1990)
Article Google Scholar
Lewis, D.D.: Representation and learning in information retrieval. Phd thesis, University of Massachusetts (1992)
Google Scholar
Weiss, S.M., Indurkhya, N.: Optimized rule induction. IEEE Expert 8, 61–69 (1993) (TY - JOUR)
Google Scholar
Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)
Google Scholar
Chickering, D.M., Heckerman, D., Meek, C.: A Bayesian approach to learning Bayesian networks with local structure. In: Proceedings of Thirteenth Conference on Uncertainty in Artificial Intelligence, pp. 80–89. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval. In: Proceedings of ECML 1998, 10th European Conference on Machine Learning, Chemnitz, DE, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar
Heckerman, D., Geiger, D., Chickering, D.M.: Learning bayesian networks: The combination of knowledge and statistical data. In: KDD Workshop, pp. 85–96 (1994)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines: and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Google Scholar
Vapnik, V.N.: Constructing learning algorithm. In: The Nature of Statistical Learning Theory, pp. 119–156. Springer, New York (1995)
Google Scholar
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) Second International Symposium on Information Theory, Armenia, pp. 267–281 (1974)
Google Scholar
Schwarz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461–464 (1978)
Article MATH MathSciNet Google Scholar
Dietterich, T.: Overfitting and undercomputing in machine learning. ACM Computer Survery 27, 326–327 (1995)
Article Google Scholar
Quinlan, J.R., Cameron-Jones, R.M.: Oversearching and layered search in empirical learning. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1019–1024. Morgan Kaufmann, Montreal (1995)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.Y.: An evaluation on feature selection for text clustering. In: Fawcett, T., Mishra, N. (eds.) ICML 2003: The 20th International Conference on Machine Learning, pp. 488–495. AAAI Press, Menlo Park (2003)
Google Scholar
Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, Seattle, Washington, United States, pp. 246–254. ACM Press, New York (1995)
Chapter Google Scholar
Van Rijsbergen, C.J.: Evaluation. In: Dept. of Computer Science, University of Glasgow, Department of Computer Science, University of Glasgow (1979)
Google Scholar
Lewis, D.D.: Reuters corpus (21578) (2000), http://www.daviddlewis.com/resources/testcollections/reuters21578/
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 1993), Pittsburgh, Pennsylvania, United States, pp. 191–202. ACM Press, New York (1993)
Chapter Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In: Sharp, H.S. (ed.) Readings in Information Retrieval. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Ward, G.: Moby Word–Moby lexicon project (1996), http://www.dcs.shef.ac.uk/research/ilash/Moby/mwords.html

Download references

Author information

Authors and Affiliations

School of Information Technology and Electrical Engineering, University of Queensland, QLD, 4072, Australia
Ding-Yi Chen, Xue Li, Zhao Yang Dong & Xia Chen

Authors

Ding-Yi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xue Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Yang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xia Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstr. 9-11/188, A-1040, Wien, Austria
A Min Tjoa
Department of Software and Computing Systems, University of Alicante, Spain
Juan Trujillo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, DY., Li, X., Dong, Z.Y., Chen, X. (2005). Effectiveness of Document Representation for Classification. In: Tjoa, A.M., Trujillo, J. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2005. Lecture Notes in Computer Science, vol 3589. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546849_36

Download citation

DOI: https://doi.org/10.1007/11546849_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28558-8
Online ISBN: 978-3-540-31732-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics