Skip to main content

A Probabilistic Model Based on Uncertainty for Data Clustering

  • Conference paper
Agents and Data Mining Interaction (ADMI 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7607))

Included in the following conference series:

  • 1071 Accesses

Abstract

Recently, all kinds of data in real-life have exploded in an unbelievable way. In order to manage these data, dataspace has been becoming a universal platform, which contains various kinds of data, such as unstructured data, semi-structured data and structured data. But how to cluster these data in dataspace in an efficient and accurate way to help the user manage and explore them is still an intractable problem. In the previous work, the uncertain relationship between term and topic is not considered sufficiently. There are many techniques to handle this problem and probability theory provides an effective way to deal with the uncertainty of clustering. As a result, we proposed a novel probability model based on topic terms, i.e., Probabilistic Term Similarity Model (PTSM) to tackle the uncertainty between term and topic. In this model, not only terms from various data but also structure information of semi-structured and structured data are considered. Each term is assigned a probability indicating how relevant it is to the topic. Then, according to the probability for each term, a probabilistic matrix is established for clustering various data. At last, extensive experiment results show that the clustering method based on this probabilistic model has excellent performance and outperforms some other classical algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: EASE: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured Data. In: Proceedings of Special Interest Group on Management of Data, pp. 903–914 (2008)

    Google Scholar 

  2. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2), Ariticle 6 (2006)

    Google Scholar 

  3. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. Technical Report. University of Minnesota-Computer Science and Engineering, Minnesota (2000)

    Google Scholar 

  4. Li, T., Ding, C., Zhang, Y., Shao, B.: Knowledge transformation from word space to document space. In: Proceedings of Special Interest Group on Information Retrieval, pp. 187–194 (2008)

    Google Scholar 

  5. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 16–22 (1999)

    Google Scholar 

  6. Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann Ltd. (1989)

    Google Scholar 

  7. Kowalski, G.: Information retrieval systems: theory and implementation. Springer, 10.1016/S0898-1221(97)80229-5 (1998)

    Google Scholar 

  8. Strehl, A., Ghosh, J.: Cluster ensembles: a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research 3, 583–617 (2003)

    MathSciNet  MATH  Google Scholar 

  9. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of Special Interest Group on Information Retrieval, pp. 50–57 (1999)

    Google Scholar 

  10. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  11. Rafi, M., Maujood, M., Fazal, M.M., Ali, S.M.: A comparison of two suffix tree-based document clustering algorithms. CoRR abs/1112.6222 (2011)

    Google Scholar 

  12. Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix factorization. Nature 401, 788–791 (1999)

    Article  Google Scholar 

  13. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of Special Interest Group on Information Retrieval, pp. 267–273 (2003)

    Google Scholar 

  14. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 269–274 (2001)

    Google Scholar 

  15. Hofmann, T., Puzicha, J.: Statistical models for co-occurrence data. Technical Report AIM, 1625 (1998)

    Google Scholar 

  16. Wang, W., Barnaghi, P., Bargiela, A.: Probabilistic Topic Models for Learning Terminological Ontologies. IEEE Transactions on Knowledge and Data Engineering, 1028–1040 (2010)

    Google Scholar 

  17. Cao, L.: Data Mining and Multi-agent Integration (edited). Springer (2009)

    Google Scholar 

  18. Cao, L., Weiss, G., Yu, P.S.: A Brief Introduction to Agent Mining. Journal of Autonomous Agents and Multi-Agent Systems 25, 419–424 (2012)

    Article  Google Scholar 

  19. Cao, L., Gorodetsky, V., Mitkas, P.A.: A Agent Mining: The Synergy of Agents and Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yu, Y., Zhu, X., Li, M., Wang, G., Luo, D. (2013). A Probabilistic Model Based on Uncertainty for Data Clustering. In: Cao, L., Zeng, Y., Symeonidis, A.L., Gorodetsky, V.I., Yu, P.S., Singh, M.P. (eds) Agents and Data Mining Interaction. ADMI 2012. Lecture Notes in Computer Science(), vol 7607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36288-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36288-0_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36287-3

  • Online ISBN: 978-3-642-36288-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics