skip to main content
10.1145/952532.952618acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
Article

A new distributed data mining model based on similarity

Published:09 March 2003Publication History

ABSTRACT

Distributed Data Mining (DDM) has been very active and enjoying a growing amount attention since its inception. Current DDM techniques regard the distributed data sets as a single virtual table and assume there exists a global model which could be generated if the data were combined/centralized. This paper proposes a similarity-based distributed data mining(SBDDM) framework which explicitly take the differences among distributed sources into consideration. A new similarity measure is introduced and its effectiveness is then evaluated and validated. This paper also illustrates the limitations of current DDM techniques through three concrete case studies. Finally distributed clustering within the SBDDM framework is also discussed.

References

  1. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference (pp. 61--72).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining associations between sets of items in massive databases. ACM-SIGMOD-1993 (pp. 207--216).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Burdick, D., Calimlim, M., & Gehrke, J. (2001). MAFIA: A maximal frequent itemset algorithm for transactional databases. ICDE (pp. 443--452).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chan, P. C., & Stolfo, S. (1993). Meta-learning for multistrategy and parallel learning. Proceedings of the Second International Workshop on Multistrategy Learning.]]Google ScholarGoogle Scholar
  5. Cheung, D. W., Ng, V. T., Fu, A. W., & Fu, Y. J. (1996). Efficient mining of association rules in distributed databases. IEEE Trans. On Knowledge and Data Engineering, 8, 911--922.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cho, V., & Wuthrich, B. (1998). Towards real time discovery from distributed information sources. PAKDD.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Das, G., Gunopulos, D., & Mannila, H. (1997). Finding similar time series. Principles of Data Mining and Knowledge Discovery (pp. 88--100).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Das, G., & Mannila, H. (2000). Context-based similarity methods for categorical attributes. PKDD (pp. 201--211).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ganti, V., Gehrke, J., & Ramakrishnan, R. (1999). A framework for measuring changes in data characteristics. Proceedings of 18th Symposium on Principles of Database Systems (pp. 126--137). ACM Press.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gouda, K., & Zaki, M. J. (2001). Efficiently mining maximal frequent itemsets. ICDM.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Kargupta, H., & Chan, P. (Eds.). (2000). Advances in distributed and parallel data mining. AAAI Press.]]Google ScholarGoogle Scholar
  12. Kargupta, H., Park, B., Hershbereger, D., & Johnson, E. (2000). Collective data mining: A new perspective toward distributed data mining. In H. Kargupta and P. Chan (Eds.), Advances in distributed data mining, 133--184. AAAI/MIT.]]Google ScholarGoogle Scholar
  13. Lam, W., & Segre, A. M. (1997). Distributed data mining of probabilistic knowledge. ICDCS.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Li, T., Ogihara, M., & Zhu, S. (2002). Similarity testing between heterogeneous basket databases (Technical Report 781). Computer Science, Univ. of Rochester.]]Google ScholarGoogle Scholar
  15. Parthasarathy, S., & Ogihara, M. (2000). Clustering distributed homogeneous datasets. PKDD.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Wirth, M. B., & Hipp, J. (2001). When distribution is part of the semantics: A new problem class for distributed knowledge discovery. In Proceedings of workshop on Ubiquitous Data Mining for Mobile and Distributed Environments, PKDD/ECML 2001.]]Google ScholarGoogle Scholar
  17. Rafiei, D., & Mendelzon, A. (1997). Similarity-based queries for time series data (pp. 13--25).]]Google ScholarGoogle Scholar
  18. Ronkainen, R. (1998). Attribute similarity and event sequence similarity in data mining. Ph.lic.thesis, University of Helsinki. Available as Report C-1998-42, University of Helsinki, Department of Computer Science, October 1998.]]Google ScholarGoogle Scholar
  19. Subramonian, R. (1998). Defining diff as a data mining primitive. KDD.]]Google ScholarGoogle Scholar
  20. Turnisky, A., & Grossman, R. (2000). A framework for finding distributed data mining strategies that are intermediate between centralized strategies and in-place strategies. Proc. of KDD Workshop on Distributed Data Mining.]]Google ScholarGoogle Scholar
  21. Yamanishi, K. (1997). Distributed cooperative bayesian learning strategies. Proceedings of COLT 97 (pp. 250--262). New York: ACM.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Zaki, M., & Ho, C. (Eds.). (2000). Large-scale parallel data mining. Springer.]]Google ScholarGoogle Scholar
  23. Zhu, S., Li, T., & Ogihara, M. (2002). CoFD: An algorithm for non-distance based clustering in high dimensional spaces. DaWaK.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SAC '03: Proceedings of the 2003 ACM symposium on Applied computing
    March 2003
    1268 pages
    ISBN:1581136242
    DOI:10.1145/952532

    Copyright © 2003 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 March 2003

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate1,650of6,669submissions,25%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader