Automatic Document Organization in a P2P Environment

Siersdorfer, Stefan; Sizov, Sergej

doi:10.1007/11735106_24

Stefan Siersdorfer²² &
Sergej Sizov²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Included in the following conference series:

European Conference on Information Retrieval

1563 Accesses
17 Citations
3 Altmetric

Abstract

This paper describes an efficient method to construct reliable machine learning applications in peer-to-peer (P2P) networks by building ensemble based meta methods. We consider this problem in the context of distributed Web exploration applications like focused crawling. Typical applications are user-specific classification of retrieved Web contents into personalized topic hierarchies as well as automatic refinements of such taxonomies using unsupervised machine learning methods (e.g. clustering). Our approach is to combine models from multiple peers and to construct the advanced decision model that takes the generalization performance of multiple ‘local’ peer models into account. In addition, meta algorithms can be applied in a restrictive manner, i.e. by leaving out some ‘uncertain’ documents. The results of our systematic evaluation show the viability of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The 20 newsgroups data set, http://www.ai.mit.edu/jrennie/20Newsgroups/
dmoz - open directory project, http://dmoz.org/
Internet movie database, http://www.imdb.com
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
MATH Google Scholar
Burges, C.: A tutorial on Support Vector Machines for pattern recognition. Data Mining and Knowledge Discovery 2(2) (1998)
Google Scholar
Chakrabarti, S.: Mining the Web. Morgan Kaufmann, San Francisco (2003)
Google Scholar
Chan, P.: An extensible meta-learning approach for scalable and accurate inductive learning. PhD thesis, Department of Computer Science, Columbia University, New York (1996)
Google Scholar
Craven, M., et al.: Learning to extract symbolic knowledge from the World Wide Web. In: 15th National Conference on Artificial Intelligence, AAAI (1998)
Google Scholar
Demers, A., et al.: Epidemic algorithms for replicated database management. In: 6th Annual ACM Symposium on Principles of Distributed Computing, PODC 1987 (1987)
Google Scholar
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Chapter Google Scholar
Fred, A., Jain, A.K.: Robust data clustering. In: Proc. Conference on Computer Vision and Pattern Recognition, CVPR (2003)
Google Scholar
Freund, Y.: An adaptive version of the boost by majority algorithm. In: Workshop on Computational Learning Theory (1999)
Google Scholar
Gorunova, K., Merz, P.: Reliable multicast and its probabilistic model for job submission in peer-to-peer grids. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 504–511. Springer, Heidelberg (2005)
Chapter Google Scholar
Hartigan, J., Wong, M.: A k-means clustering algorithm. Applied Statistics 28, 100–108 (1979)
Article MATH Google Scholar
Kargupta, H., Huang, W., Sivakumar, K., Johnson, E.L.: Distributed clustering using collective principal component analysis. Knowledge and Information Systems 3(4), 422–448 (2001)
Article MATH Google Scholar
Lewis, D.D.: Evaluating text categorization. In: Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Chapter Google Scholar
Li, T., Zhu, S., Ogihara, M.: Algorithms for Clustering High Dimensional and Distributed Data. Intelligent Data Analysis Journal 7(4) (2003)
Google Scholar
Littlestone, N., Warmuth, M.: The weighted majority algorithm. In: FOCS (1989)
Google Scholar
Merugu, S., Ghosh, J.: Privacy-preserving distributed clustering using generative models. In: International Conference on Data Mining (ICDM 2003), Melbourne, FL (2003)
Google Scholar
Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers. MIT Press, Cambridge (1999)
Google Scholar
Porter, M.: An algorithm for suffix stripping. Automated Library and Information Systems 14(3)
Google Scholar
Rivest, R.: The MD5 message digest algorithm. RFC 1321 (1992)
Google Scholar
Siersdorfer, S., Sizov, S.: Restrictive Clustering and Metaclustering for Self- Organizing Document Collections. In: SIGIR (2004)
Google Scholar
Siersdorfer, S., Sizov, S., Weikum, G.: Goal-oriented methods and meta methods for document classification and their parameter tuning. In: CIKM, Washington, USA (2004)
Google Scholar
Strehl, A., Gosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
MathSciNet MATH Google Scholar
Vaidya, J., Clifton, C.: Privacy preserving naïve bayes classifier for vertically partitioned data. In: SDM (2004)
Google Scholar
Wang, H., Fan, W., Yu, P., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: SIGKDD (2003)
Google Scholar
Wolpert, D.: Stacked generalization. Neural Networks 5, 241–259 (1992)
Article Google Scholar
Yu, H., Chang, K., Han, J.: Heterogeneous learner for Web page classification. In: ICDM (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Max-Planck Institute for Computer Science, Germany
Stefan Siersdorfer & Sergej Sizov

Authors

Stefan Siersdorfer
View author publications
You can also search for this author in PubMed Google Scholar
Sergej Sizov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Queen Mary, University of London, London, UK
Mounia Lalmas
Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK
Andy MacFarlane
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Queen Mary University of London, UK
Anastasios Tombros
CWI, Amsterdam, The Netherlands
Theodora Tsikrika
Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK
Alexei Yavlinsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Siersdorfer, S., Sizov, S. (2006). Automatic Document Organization in a P2P Environment. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_24

Download citation

DOI: https://doi.org/10.1007/11735106_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics