Skip to main content

A Genetic Programming Approach for Combining Structural and Citation-Based Evidence for Text Classification in Web Digital Libraries

  • Chapter
Soft Computing in Web Information Retrieval

Summary

This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Robert Amsler. Application of citation-based automatic classification. Technical report, The University of Texas at Austin, Linguistics Research Center, Austin, TX, December 1972.

    Google Scholar 

  2. Pável Calado, Marco Cristo, Edleno Silva de Moura, Nivio Ziviani, Berthier A. Ribeiro-Neto, and Marcos André Gonçalves. Combining link-based and content-based methods for Web document classification. In Proceedings of CIKM-03, 12th ACM International Conference on Information and Knowledge Management, pages 394–401, New Orleans, US, 2003. ACM Press, New York, US.

    Google Scholar 

  3. Soumen Chakrabarti, Byron Dom, and Piotr Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307–318, Seattle, Washington, June 1998.

    Google Scholar 

  4. Sin Man Cheang, Kin Hong Lee, and Kwong Sak Leung. Data classification using genetic parallel programming. In E. Cantú-Paz, J. A. Foster, K. Deb, D. Davis, R. Roy, U.-M. O’Reilly, H.-G. Beyer, R. Standish, G. Kendall, S. Wilson, M. Harman, J. Wegener, D. Dasgupta, M. A. Potter, A. C. Schultz, K. Dowsland, N. Jonoska, and J. Miller, editors, Genetic and Evolutionary Computation — GECCO-2003, volume 2724 of LNCS, pages 1918–1919, Chicago, 12–16 July 2003. Springer-Verlag.

    Google Scholar 

  5. Chris Clack, Johnny Farringdon, Peter Lidwell, and Tina Yu. Autonomous document classification for business. In AGENTS’ 97: Proceedings of the first international conference on Autonomous agents, pages 201–208. ACM Press, 1997.

    Google Scholar 

  6. David Cohn and Thomas Hofmann. The missing link-a probabilistic model of document content and hypertext connectivity. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430–436. MIT Press, 2001.

    Google Scholar 

  7. I. De Falco, A. Della Cioppa, and E. Tarantino. Discovering interesting classification rules with genetic programming. Applied Soft Computing, 1(4F):257–269, May 2001.

    Google Scholar 

  8. Jeffrey Dean and Monika Rauch Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11–16):1467–1479, May 1999. Also in Proceedings of the 8th International World Wide Web Conference.

    Article  Google Scholar 

  9. M. Dolores del Castillo and José Ignacio Serrano. A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl., 6(1):70–79, 2004.

    MATH  Google Scholar 

  10. J. Eggermont, J. N. Kok, and W. A. Kosters. Genetic programming for data classification: Refining the search space. In T. Heskes, P. Lucas, L. Vuurpijl, and W. Wiegerinck, editors, Proceedings of the Fivteenth Belgium/Netherlands Conference on Artificial Intelligence (BNAIC’03), pages 123–130, Nijmegen, The Netherlands, 23–24 October 2003.

    Google Scholar 

  11. Weiguo Fan, Edward A. Fox, Praveen Pathak, and Harris Wu. The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7):628–636, 2004.

    Article  Google Scholar 

  12. Weiguo Fan, Michael D. Gordon, and Praveen Pathak. Personalization of search engine services for effective retrieval and knowledge management. In The Proceedings of the International Conference on Information Systems 2000, pages 20–34, 2000.

    Google Scholar 

  13. Weiguo Fan, Michael D. Gordon, and Praveen Pathak. Discovery of context-specific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Knowledge and Data Engineering, 16(4):523–527, 2004.

    Article  Google Scholar 

  14. Weiguo Fan, Michael D. Gordon, and Praveen Pathak. A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing and Management, 40(4):587–602, 2004.

    Article  MATH  Google Scholar 

  15. Weiguo Fan, Michael D. Gordon, Praveen Pathak, Wensi Xi, and Edward A. Fox. Ranking function optimization for effective web search by genetic programming: An empirical study. In Proceedings of 37th Hawaii International Conference on System Sciences, Hawaii, 2004. IEEE.

    Google Scholar 

  16. Weiguo Fan, Ming Luo, Li Wang, Wensi Xi, and Edward A. Fox. Tuning before feedback: combining ranking function discovery and blind feedback for robust retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference, U.K., 2004. ACM.

    Google Scholar 

  17. Michelle Fisher and Richard Everson. When are links useful? Experiments in text classification. In F. Sebastianini, editor, Proceedings of the 25th annual European conference on Information Retrieval Research, ECIR 2003, pages 41–56. Springer-Verlag, Berlin, Heidelberg, DE, 2003.

    Google Scholar 

  18. Johannes Furnkranz. Exploiting structural information for text classification on the WWW. In Intelligent Data Analysis, pages 487–498, 1999.

    Google Scholar 

  19. Lee Giles. Citeseer: An automatic citation indexing system. December 16 1998.

    Google Scholar 

  20. Eric J. Glover, Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock, and Gary W. Flake. Using Web structure for classifying and describing Web pages. In Proceedings of WWW-02, International Conference on the World Wide Web, 2002.

    Google Scholar 

  21. M. D. Gordon. User-based document clustering by redescribing subject descriptions with a genetic algorithm. Journal of the American Society for Information Science, 42(5):311–322, June 1991.

    Article  Google Scholar 

  22. Michael Gordon. Probabilistic and genetic algorithms for document retrieval. Communications of the ACM, 31(10):1208–1218, October 1988.

    Article  Google Scholar 

  23. Norbert Gövert, Mounia Lalmas, and Norbert Fuhr. A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th International Conference on Information and Knowledge Management CIKM 99, pages 475–482, Kansas City, Missouri, USA, November 1999.

    Google Scholar 

  24. Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137–142, Chemnitz, Germany, April 1998.

    Google Scholar 

  25. Thorsten Joachims, Nello Cristianini, and John Shawe-Taylor. Composite kernels for hypertext categorisation. In Carla Brodley and Andrea Danyluk, editors, Proceedings of ICML-01, 18th International Conference on Machine Learning, pages 250–257, Williams College, US, 2001. Morgan Kaufmann Publishers, San Francisco, US.

    Google Scholar 

  26. M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10–25, January 1963.

    Google Scholar 

  27. J. K. Kishore, L. M. Patnaik, V. Mani, and V. K. Agrawal. Genetic programming based pattern classification with feature space partitioning. Information Sciences, 131(1–4):65–86, January 2001.

    Article  MATH  Google Scholar 

  28. J. K. Kishore, Lalit M. Patnaik, V. Mani, and V. K. Agrawal. Application of genetic programming for multicategory pattern classification. IEEE Trans. Evolutionary Computation, 4(3):242–258, 2000.

    Article  Google Scholar 

  29. Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999.

    Article  MATH  MathSciNet  Google Scholar 

  30. John R. Koza. Genetic programming: On the programming of computers by natural selection. MIT Press, Cambridge, Mass., 1992.

    Google Scholar 

  31. S. Lawrence, C. L. Giles, and K. Bollacker. “Digital Libraries and Autonomous Citation Indexing”. IEEE Computer, 32(6):67–71, 1999.

    Google Scholar 

  32. Steve Lawrence, C. Lee Giles, and Kurt D. Bollacker. Autonomous citation matching. In Oren Etzioni, Jörg P. Müller, and Jeffrey M. Bradshaw, editors, Proceedings of the Third International Conference on Autonomous Agents (Agents’99), pages 392–393, Seattle, WA, USA, 1999. ACM Press.

    Google Scholar 

  33. M. J. Martin-Bautista, M. Vila, and H. L. Larsen. A fuzzy genetic algorithm approach to an adaptive information retrieval agent. American Society for Information Science, 50:760–771, 1999.

    Article  Google Scholar 

  34. Andrew Kachites McCallum and Kamal Nigam. Employing EM and pool-based active learning for text classification. In Proc. 15th International Conf. on Machine Learning, pages 350–358. Morgan Kaufmann, San Francisco, CA, 1998.

    Google Scholar 

  35. Frederic C. Misch, editor. Webster’s Ninth New Collegiate Dictionary. Merriam-Webster Inc., Springfield, Massachusetts, 1988.

    Google Scholar 

  36. Hyo-Jung Oh, Sung Hyon Myaeng, and Mann-Ho Lee. A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, pages 264–271. ACM Press, 2000.

    Google Scholar 

  37. P. Pathak, M. Gordon, and W. Fan. Effective information retrieval using genetic algorithms based matching function adaptation. In Proceedings of the 33rd Hawaii International Conference on System Science (HICSS), Hawaii, USA, 2000.

    Google Scholar 

  38. Vijay V. Raghavan and Brijesh Agarwal. Optimal determination of user-oriented clusters: an application for the reproductive plan. In John J. Grefenstette, editor, Proceedings of the 2nd International Conference on Genetic Algorithms and their Applications, pages 241–246, Cambridge, MA, July 1987. Lawrence Erlbaum Associates.

    Google Scholar 

  39. S. E. Robertson, S. Walker, and M. M. Beaulieu. Okapi at TREC-4. In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4), pages 73–96, 1995.

    Google Scholar 

  40. Maytal Saar-Tsechansky and Foster Provost. Active learning for class probability estimation and ranking. In Bernhard Nebel, editor, Proceedings of the Seventeenth International Conference on Artificial Intelligence (IJCAI-01), pages 911–920, San Francisco, CA, August 4–10 2001. Morgan Kaufmann Publishers, Inc.

    Google Scholar 

  41. Gerard Salton. Automatic Text Processing. Addison-Wesley, Boston, Massachusetts, USA, 1989.

    Google Scholar 

  42. Gerard Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.

    Article  Google Scholar 

  43. Henry G. Small. Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4):265–269, July 1973.

    Google Scholar 

  44. A. Srinivasan. A study of two sampling methods for analysing large datasets with ILP. Data Mining and Knowledge Discovery, 3(1):95–123, 1999.

    Article  Google Scholar 

  45. Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. Web classification using support vector machine. In Proceedings of the fourth international workshop on Web information and data management, pages 96–99. ACM Press, 2002.

    Google Scholar 

  46. Yiming Yang. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 13–22, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.

    Google Scholar 

  47. Yiming Yang, Seán Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2–3):219–241, 2002.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Zhang, B. et al. (2006). A Genetic Programming Approach for Combining Structural and Citation-Based Evidence for Text Classification in Web Digital Libraries. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_4

Download citation

  • DOI: https://doi.org/10.1007/3-540-31590-X_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31588-9

  • Online ISBN: 978-3-540-31590-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics