Skip to main content

Progressive Filtering on the Web: The Press Reviews Case Study

  • Chapter
Learning Structure and Schemas from Documents

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

  • 639 Accesses

Abstract

Progressive Filtering is a hierarchical classification technique framed within the local classifier per node approach where each classifier is entrusted with deciding whether the input in hand can be forwarded or not to its children. In this chapter, we illustrate the effectiveness of Progressive Filtering on the Web, focusing on the task of automatically creating press reviews. To this end, we present NEWS.MAS, a multiagent system aimed at: (i) extracting information from online newspapers by using suitable wrapper agents, each associated with a specific information source, (ii) categorizing news articles according to a given taxonomy, and (iii) providing user feedback to improve the performance of the system depending on user needs and preferences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Addis, A., Armano, G., Mascia, F., Vargiu, E.: News retrieval through a multiagent system. In: WOA 2007 Dagli Oggetti agli Agenti: Agenti e Industria: Applicazioni tecnologiche degli agenti software, pp. 48–54 (2007)

    Google Scholar 

  2. Addis, A., Armano, G., Vargiu, E.: From a generic multiagent architecture to multiagent information retrieval systems. In: AT2AI-6, Sixth International Workshop, From Agent Theory to Agent Implementation, pp. 3–9 (2008)

    Google Scholar 

  3. Addis, A., Armano, G., Vargiu, E.: Assessing progressive filtering to perform hierarchical text categorization in presence of input imbalance. In: Proceedings of International Conference on Knowledge Discovery and Information Retrieval (KDIR 2010), pp. 14–23 (2010)

    Google Scholar 

  4. Addis, A., Armano, G., Vargiu, E.: A comparative experimental assessment of a threshold selection algorithm in hierarchical text categorization. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 32–42. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  5. Addis, A., Cherhi, G., Manconi, A., Vargiu, E.: A multiagent system for personalized press reviews. In: Soro, A., Armano, G., Paddeu, G. (eds.) Distributed Agent-Based Retrieval Tools, Polimetrica, pp. 67–86 (2006)

    Google Scholar 

  6. Armano, G.: On the progressive filtering approach to hierarchical text categorization. Tech. rep., DIEE - University of Cagliari (2009)

    Google Scholar 

  7. Armstrong, R., Freitag, D., Joachims, T., Mitchell, T.: Webwatcher: A learning apprentice for the world wide web. In: AAAI Spring Symposium on Information Gathering, pp. 6–12 (1995)

    Google Scholar 

  8. Bellifemine, F.L., Caire, G., Greenwood, D.: Developing Multi-Agent Systems with JADE. Wiley Series in Agent Technology. John Wiley and Sons, Chichester (2007)

    Book  Google Scholar 

  9. Bennett, P.N., Nguyen, N.: Refined experts: improving classification in large taxonomies. In: SIGIR 2009: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 11–18. ACM, New York (2009)

    Chapter  Google Scholar 

  10. Bleyer, M.: Multi-agent systems for information retrieval on the world wide web. Ph.D. thesis, University of Ulm, Germany (1998)

    Google Scholar 

  11. Brank, J., Mladenić, D., Grobelnik, M.: Large-scale hierarchical text classification using svm and coding matrices. In: Large-Scale Hierarchical Classification Workshop (2010)

    Google Scholar 

  12. Ceci, M., Malerba, D.: Hierarchical classification of HTML documents with webClassII. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 57–72. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  13. Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. Journal of Intelligent Information Systems 28(1), 37–78 (2007)

    Article  Google Scholar 

  14. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)

    MATH  Google Scholar 

  15. Christopher, D., Manning, P.R., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    MATH  Google Scholar 

  16. Cost, W., Salzberg, S.: A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning 10, 57–78 (1993)

    Google Scholar 

  17. D’Alessio, S., Murray, K., Schiaffino, R.: The effect of using hierarchical classifiers in text categorization. In: Proceedings of of the 6th International Conference on Recherche d’Information Assistée par Ordinateur (RIAO), pp. 302–313 (2000)

    Google Scholar 

  18. Dumais, S.T., Chen, H.: Hierarchical classification of Web content. In: Belkin, N.J., Ingwersen, P., Leong, M.-K. (eds.) Proceedings of SIGIR 2000, 23rd ACM International Conference on Research and Development in Information Retrieval, pp. 256–263. ACM Press, New York (2000)

    Google Scholar 

  19. Esuli, A., Fagni, T., Sebastiani, F.: Boosting multi-label hierarchical text categorization. Inf. Retr. 11(4), 287–313 (2008)

    Article  Google Scholar 

  20. Etzioni, O., Weld, D.: Intelligent agents on the internet: fact, fiction and forecast. IEEE Expert 10(4), 44–49 (1995)

    Article  Google Scholar 

  21. Fu, Y., Ke, W., Mostafa, J.: Automated text classification using a multi-agent framework. In: JCDL 2005: Proceedings of the 5th ACM, IEEE-CS Joint Conference on Digital Libraries, pp. 157–158. ACM Press, USA (2005), http://doi.acm.org/10.1145/1065385.1065420

    Chapter  Google Scholar 

  22. Gaussier, É., Goutte, C., Popat, K., Chen, F.: A hierarchical model for clustering and categorising documents. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 229–247. Springer, Heidelberg (2002), http://link.springer.de/link/service/series/0558/papers/2291/22910229.pdf

    Chapter  Google Scholar 

  23. Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop on Learning from Imbalanced Data Sets (2000)

    Google Scholar 

  24. Jirapanthong, W., Sunetnanta, T.: An xml-based multi-agents model for information retrieval on www. In: Proceedings of the 4th National Computer Science and Engineering Conference, NCSEC 2000 (2000)

    Google Scholar 

  25. Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 170–178. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  26. Kotsiantis, S., Pintelas, P.: Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinformatics 1, 46–55 (2003)

    Google Scholar 

  27. Kotsiantis, S.B.: Local reweight wrapper for the problem of imbalance. Int. J. of Artificial Intelligence and Soft Computing 1, 25–38 (2008), http://www.inderscience.com/link.php?id=21262

    Article  Google Scholar 

  28. Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  29. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002), http://doi.acm.org/10.1145/565117.565137

    Article  Google Scholar 

  30. Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  31. Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: SIGIR 1995: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 246–254. ACM, New York (1995), http://doi.acm.org/10.1145/215206.215366

  32. Lieberman, H.: Letizia: An agent that assists web browsing. In: Mellish, C.S. (ed.) Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI 1995), pp. 924–929. Morgan Kaufmann Publishers Inc., San Francisco (1995), citeseer.ist.psu.edu/lieberman95letizia.html

    Google Scholar 

  33. McCallum, A.K., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Shavlik, J.W. (ed.) Proceedings of ICML 1998 15th International Conference on Machine Learning, pp. 359–367. Morgan Kaufmann, San Francisco (1998)

    Google Scholar 

  34. Mladenić, D., Grobelnik, M.: Feature selection for classification based on text hierarchy. In: Text and the Web, Conference on Automated Learning and Discovery CONALD 1998 (1998)

    Google Scholar 

  35. Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Learning hierarchical multi-category text classification models. In: ICML 2005: Proceedings of the 22nd international conference on Machine learning, pp. 744–751. ACM, New York (2005)

    Chapter  Google Scholar 

  36. Ruiz, M.E., Srinivasan, P.: Hierarchical text categorization using neural networks. Information Retrieval 5(1), 87–118 (2002)

    Article  MATH  Google Scholar 

  37. Shaban, K., Basir, O., Kamel, M.: Team consensus in web multi-agents information retrieval system. In: Team consensus in web multi-agents information retrieval system, pp. 68–73 (2004)

    Google Scholar 

  38. Sheth, B., Maes, P.: Evolving agents for personalized information filtering. In: Proceedings of the 9th Conference on Artificial Intelligence for Applications (CAIA 1993), pp. 345–352 (1993)

    Google Scholar 

  39. Silla, C., Freitas, A.: A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 22, 31–72 (2011); doi:10.1007/s10618-010-0175-9, http://dx.doi.org/10.1007/s10618-010-0175-9

    Article  MathSciNet  Google Scholar 

  40. Sun, A., Lim, E.: Hierarchical text classification and evaluation. In: ICDM 2001: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 521–528. IEEE Computer Society Press, Washington, DC, USA (2001)

    Google Scholar 

  41. Sycara, K., Paolucci, M., van Velsen, M., Giampapa, J.: The RETSINA MAS infrastructure. Tech. Rep. CMU-RI-TR-01-05, Robotics Institute Technical Report, Carnegie Mellon (2001), citeseer.ist.psu.edu/article/sycara01retsina.html

  42. Takigawa, Y., Hotta, S., Kiyasu, S., Miyahara, S.: Pattern classification using weighted average patterns of categorical k-nearest neighbors. In: Proceedings of the 1th International Workshop on Camera-Based Document Analysis and Recognition, pp. 111–118 (2005)

    Google Scholar 

  43. Weigend, A.S., Wiener, E.D., Pedersen, J.O.: Exploiting hierarchy in text categorization. Information Retrieval 1(3), 193–216 (1999)

    Article  Google Scholar 

  44. Wooldridge, M.J., Jennings, N.R.: Agent Theories, Architectures, and Languages: A Survey. In: Wooldridge, M.J., Jennings, N.R. (eds.) ECAI 1994 and ATAL 1994. LNCS, vol. 890, pp. 1–22. Springer, Heidelberg (1995), citeseer.ist.psu.edu/article/wooldridge94agent.html

    Google Scholar 

  45. Wu, F., Zhang, J., Honavar, V.G.: Learning classifiers using hierarchically structured class taxonomies. In: Zucker, J.-D., Saitta, L. (eds.) SARA 2005. LNCS (LNAI), vol. 3607, pp. 313–320. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  46. Wu, G., Chang, E.Y.: Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 Workshop on Learning from Imbalanced Data Sets, pp. 49–56 (2003)

    Google Scholar 

  47. Yan, A.R., Liu, Y., Jin, R., Hauptmann, A.: On predicting rare classes with svm ensembles in scene classification. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2003), vol. 3, pp. III-21–4 (2003)

    Google Scholar 

  48. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1/2), 69–90 (1999), citeseer.ist.psu.edu/yang97evaluation.html

    Article  Google Scholar 

  49. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML 1997, 14th International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Addis, A., Armano, G., Vargiu, E. (2011). Progressive Filtering on the Web: The Press Reviews Case Study. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22913-8_7

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22912-1

  • Online ISBN: 978-3-642-22913-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics