Abstract
This paper describes an efficient method to construct reliable machine learning applications in peer-to-peer (P2P) networks by building ensemble based meta methods. We consider this problem in the context of distributed Web exploration applications like focused crawling. Typical applications are user-specific classification of retrieved Web contents into personalized topic hierarchies as well as automatic refinements of such taxonomies using unsupervised machine learning methods (e.g. clustering). Our approach is to combine models from multiple peers and to construct the advanced decision model that takes the generalization performance of multiple ‘local’ peer models into account. In addition, meta algorithms can be applied in a restrictive manner, i.e. by leaving out some ‘uncertain’ documents. The results of our systematic evaluation show the viability of the proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
The 20 newsgroups data set, http://www.ai.mit.edu/jrennie/20Newsgroups/
dmoz - open directory project, http://dmoz.org/
Internet movie database, http://www.imdb.com
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Burges, C.: A tutorial on Support Vector Machines for pattern recognition. Data Mining and Knowledge Discovery 2(2) (1998)
Chakrabarti, S.: Mining the Web. Morgan Kaufmann, San Francisco (2003)
Chan, P.: An extensible meta-learning approach for scalable and accurate inductive learning. PhD thesis, Department of Computer Science, Columbia University, New York (1996)
Craven, M., et al.: Learning to extract symbolic knowledge from the World Wide Web. In: 15th National Conference on Artificial Intelligence, AAAI (1998)
Demers, A., et al.: Epidemic algorithms for replicated database management. In: 6th Annual ACM Symposium on Principles of Distributed Computing, PODC 1987 (1987)
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 245–260. Springer, Heidelberg (2000)
Fred, A., Jain, A.K.: Robust data clustering. In: Proc. Conference on Computer Vision and Pattern Recognition, CVPR (2003)
Freund, Y.: An adaptive version of the boost by majority algorithm. In: Workshop on Computational Learning Theory (1999)
Gorunova, K., Merz, P.: Reliable multicast and its probabilistic model for job submission in peer-to-peer grids. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 504–511. Springer, Heidelberg (2005)
Hartigan, J., Wong, M.: A k-means clustering algorithm. Applied Statistics 28, 100–108 (1979)
Kargupta, H., Huang, W., Sivakumar, K., Johnson, E.L.: Distributed clustering using collective principal component analysis. Knowledge and Information Systems 3(4), 422–448 (2001)
Lewis, D.D.: Evaluating text categorization. In: Proceedings of Speech and Natural Language Workshop. Defense Advanced Research Projects Agency, pp. 312–318. Morgan Kaufmann, San Francisco (1991)
Li, T., Zhu, S., Ogihara, M.: Algorithms for Clustering High Dimensional and Distributed Data. Intelligent Data Analysis Journal 7(4) (2003)
Littlestone, N., Warmuth, M.: The weighted majority algorithm. In: FOCS (1989)
Merugu, S., Ghosh, J.: Privacy-preserving distributed clustering using generative models. In: International Conference on Data Mining (ICDM 2003), Melbourne, FL (2003)
Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers. MIT Press, Cambridge (1999)
Porter, M.: An algorithm for suffix stripping. Automated Library and Information Systems 14(3)
Rivest, R.: The MD5 message digest algorithm. RFC 1321 (1992)
Siersdorfer, S., Sizov, S.: Restrictive Clustering and Metaclustering for Self- Organizing Document Collections. In: SIGIR (2004)
Siersdorfer, S., Sizov, S., Weikum, G.: Goal-oriented methods and meta methods for document classification and their parameter tuning. In: CIKM, Washington, USA (2004)
Strehl, A., Gosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
Vaidya, J., Clifton, C.: Privacy preserving naïve bayes classifier for vertically partitioned data. In: SDM (2004)
Wang, H., Fan, W., Yu, P., Han, J.: Mining concept-drifting data streams using ensemble classifiers. In: SIGKDD (2003)
Wolpert, D.: Stacked generalization. Neural Networks 5, 241–259 (1992)
Yu, H., Chang, K., Han, J.: Heterogeneous learner for Web page classification. In: ICDM (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Siersdorfer, S., Sizov, S. (2006). Automatic Document Organization in a P2P Environment. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_24
Download citation
DOI: https://doi.org/10.1007/11735106_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)