Abstract
In this paper we develop a local distributed privacy preserving algorithm for feature selection in a large peer-to-peer environment. Feature selection is often used in machine learning for data compaction and efficient learning by eliminating the curse of dimensionality. There exist many solutions for feature selection when the data are located at a central location. However, it becomes extremely challenging to perform the same when the data are distributed across a large number of peers or machines. Centralizing the entire dataset or portions of it can be very costly and impractical because of the large number of data sources, the asynchronous nature of the peer-to-peer networks, dynamic nature of the data/network, and privacy concerns. The solution proposed in this paper allows us to perform feature selection in an asynchronous fashion with a low communication overhead where each peer can specify its own privacy constraints. The algorithm works based on local interactions among participating nodes. We present results on real-world dataset in order to test the performance of the proposed algorithm.
Similar content being viewed by others
References
Bhaduri K, Wolff R, Giannella C, Kargupta H (2008) Distributed decision tree induction in peer-to-peer systems. Stat Anal Data Min J 1(2): 85–103
Chen R, Sivakumar K, Kargupta H (2004) Collective mining of Bayesian networks from distributed heterogeneous data. Knowl Inf Syst 6(2): 164–187
Cho V, Wüthrich B (2002) Distributed mining of classification rules. Knowl Inf Syst 4(1): 1–30
Clifton C, Kantarcioglu M, Vaidya J, Lin X, Zhu M (2003) Tools for Privacy Preserving Distributed Data Mining. ACM SIGKDD Explorations 4(2): 28–34
Das K, Bhaduri K, Kargupta H (2009) A distributed asynchronous local algorithm using multi-party optimization based privacy preservation, Proceedings of P2P’09, Seattle, pp 212–221
Das K, Bhaduri K, Liu K, Kargupta H (2008) Distributed identification of top-l inner product elements and its application in a peer-to-peer network. TKDE 20(4): 475–488
Datta S, Bhaduri K, Giannella C, Wolff R, Kargupta H (2006) Distributed data mining in peer-to-peer networks. IEEE Internet Comput 10(4): 18–26
Datta S, Giannella C, Kargupta H (2006) k-Means clustering over a large, dynamic network, Proceedings of SDM’06, MD, pp 153–164
Evfimevski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining, Proceedings of SIGMOD/PODS’03, San Diego
Gilburd B, Schuster A, Wolff R (2004) k-TTP: a new privacy model for large-scale distributed environments, Proceedings of KDD’04, Seattle, pp 563–568
Jung JJ (2009) Consensus-based evaluation framework for distributed information retrieval systems. Knowl Inf Syst 18(2): 199–211
Kargupta H, Das K, Liu K (2007) Multi-party, privacy-preserving distributed data mining using a game theoretic framework, Proceedings of PKDD’07, Warsaw, pp 523–531
Kargupta H, Huang W, Sivakumar K, Johnson EL (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4): 422–448
Kargupta H, Sivakumar K (2004) Existential pleasures of distributed data mining: data mining: next generation challenges and future directions, AAAI/MIT Press
Keogh EJ, Chakrabarti K, Pazzani MJ, Mehrotra S (2001) Dimensionality reduction for fast similarity search in large time series databases. Knowl Inf Syst 3(3): 263–286
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Kluwer, London
Liu K, Bhaduri K, Das K, Nguyen P, Kargupta H (2006) Client-side web mining for community formation in peer-to-peer environments. SIGKDD Explor 8(2): 11–20
Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M (2006) l-diversity: privacy beyond k-anonymity, Proceedings of ICDE’06, Atlanta, p 24
Maulik U, Bandyopadhyay S, Trinder JC (2001) SAFE: an efficient feature extraction technique. Knowl Inf Syst 3(3): 374–387
Saroiu S, Gummadi PK, Gribble SD (2002) A measurement study of peer-to-peer file sharing systems, Proceedings of multimedia computing and networking (MMCN’02), San Jose
Sayal M, Scheuermann P (2001) Distributed web log mining using maximal large itemsets. Knowl Inf Syst 3(4): 389–404
Scherber D, Papadopoulos H (2005) Distributed computation of averages over ad hoc networks. IEEE J Sel Areas Commun 23(4): 776–787
Schuster A, Wolff R, Trock D (2005) A high-performance distributed algorithm for mining association rules. Knowl Inf Syst 7(4): 458–475
Tan P-N, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley , Reading
Teng Z, Du W (2009) Hybrid multi-group approach for privacy-preserving data mining. Knowl Inf Syst 19(2): 133–157
Waxman BM (1991) Routing of multipoint connections, pp 347–352
Wolff R, Schuster A (2004) Association rule mining in peer-to-peer systems. IEEE Trans Syst Man Cybernet Part B 34(6): 2426–2438
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. Proceedings of ICML-97, Nashville, pp 412–420
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Das, K., Bhaduri, K. & Kargupta, H. A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl Inf Syst 24, 341–367 (2010). https://doi.org/10.1007/s10115-009-0274-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0274-3