Skip to main content

A Dynamic Hierarchical Fuzzy Clustering Algorithm for Information Filtering

  • Chapter
Book cover Soft Computing in Web Information Retrieval

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 197))

Summary

In this contribution we propose a hierarchical fuzzy clustering algorithm for dynamically supporting information filtering. The idea is that document filtering can draw advantages from a dynamic hierarchical fuzzy clustering of the documents into overlapping topic categories corresponding with different levels of granularity of the categorisation. Users can have either general interests or specific ones depending on their profile and thus they must be feed with documents belonging to the categories of interest that can correspond with either a high level topic, such as sport news, or a subtopics, such as football news, or even a very specific topics such as football matches of their favourite team. The hierarchical structure of the automatically identified clusters is built so that each level corresponds with a distinct level of overlapping of the clusters in it, so that in climbing the hierarchy this value increases since the topics represented in the upper levels are more general, i.e., fuzzier. The hierarchy of fuzzy clusters is used to support the filtering criteria that are personalized based on user profiles. Since a filter monitors one or more continuously feed document streams, the clustering must be able both to generate a fuzzy hierarchical classification of a collection of documents and to update the hierarchy of existing categories by either including newly found documents or detecting new categories when such new documents have contents that are different from those represented by the existing clusters. The fuzzy clustering algorithm is based on a generalization of the fuzzy C-means algorithm that is iteratively applied to each hierarchical level to identify clusters of the higher level. In order to apply this algorithm in document filtering it has been extended so as to use a cosine similarity instead of the usual Euclidean distance, and to automatically estimate the number of the clusters to detect at each hierarchical level. This number is identified based either on an explicit input that specifies the minimum percentage of common index terms that the clusters of the level can share (that is equivalent to indicate a tolerance for overlapping between the topics dealt with in each fuzzy cluster) or on a statistical analysis of the cumulative curve of overlapping degrees between all pairs of clusters of the level. This way the problem of application of the fuzzy C means that requires the specification of the desired number of the clusters is overcome.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Basu S., Banerjee A., Mooney R.J., Semi-supervised Clustering by Seeding, in Proc. 19th Int. Conf. On Machine Learning (ICML-2002). Sydney, 2002.

    Google Scholar 

  2. Bordogna G., Pasi G., Personalised Indexing and Retrieval of Heterogeneous Structured Documents, Information Retrieval Journal, 8, 301–318, 2005.

    Article  Google Scholar 

  3. Claypool M., Gokhale A., Miranda T., Murnikov P., Netes D., Sartin M., Combining Content-based and Collaborative Filters in an Online Newspaper, in Proc. ACM SIGIR’99 Workshop on Recommender Systems-Implemenation and Evaluation, Berkeley CA, 1999.

    Google Scholar 

  4. Connor M., Herlocker J., Clustering for Collaborative Filtering, in Proc. of ACM SIGIR Workshop on Recommender Systems, Berkeley CA, 1999.

    Google Scholar 

  5. Cutting D.R., Karger D.R., Pedersen J.O., Tukey J.W., Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, in Proc. of 15th Ann In. SIGIR’92., 1992.

    Google Scholar 

  6. Debole F., Sebastiani F., Supervised Term Weighting for Automated Text Categorization. In Proc. SAC-03, 18th ACM Symposium on Applied Computing, 2003.

    Google Scholar 

  7. Dominich S., Goth J., Kiezer T., Szlavik Z., Entropy-based interpretation of Retrieval Status Value-based Retrieval. Journal of the American Society for Information Science and Technology. John Wiley & Sons, 55(7), 613–627, 2004.

    Article  Google Scholar 

  8. Estivill-Castro V., Why so Many Clustering Algorithms: a Position Paper, ACM SIGKDD Explorations Newsletter, 4(1), 2002.

    Google Scholar 

  9. Everitt B.S., Cluster Analysis, 3rd edition. Edward Arnold /Halsted Press, London, 1992.

    Google Scholar 

  10. Grossman D.A., Information retrieval, Algorithms and Heuristics, Kluwer Academic Publishers, 1998.

    Google Scholar 

  11. Hathaway, R.J., Bezdek, J.C. and Hu Y., Generalized Fuzzy C-Means Clustering Strategies Using Lp Norm Distances, IEEE Transactions on Fuzzy Systems, 8(5), 576–582, 2000.

    Article  Google Scholar 

  12. Herrera-Viedma E., Herrera F., Martinez L., Herrera J.C., Lopez A.G., Incorporatine Filtering Techniques in a Fuzzy Linguistic Multi-Agent Model for Information Gathering on the Web, Fuzzy sets and Systems, 148, 61–83, 2004.

    Article  MATH  MathSciNet  Google Scholar 

  13. http://www.newsinessence.com.

    Google Scholar 

  14. Jain A.K., Murty M.N., Flynn P.J., Data Clustering: a Review, ACM Computing Surveys, 31(3), 264–323, 1999.

    Article  Google Scholar 

  15. Jung, SungYoung, Taek-Soo Kim, An Incremental Similarity Computation Method in Agglomerative Hierarchical Clustering, in Proc. Of the 2nd International Symposium on Advanced Intelligent Systems, Daejeon, Korea, August 25, 2001

    Google Scholar 

  16. Khaled M. Hammouda, Mohamed S. Kamel: Incremental Document Clustering Using Cluster Similarity Histograms. 597–601, 2003.

    Google Scholar 

  17. Kraft D., Chen J., Martin-Bautista M.J., Vila M.A., Textual Information Retrieval with User Profiles using Fuzzy Clustering and Inferencing, in Intelligent Exploration of the Web, Szczepaniak P., Segovia J., Kacprzyk J., Zadeh L.A., Studies in Fuzziness and Soft Comp. Series, 111, Physica Verlag, 2003.

    Google Scholar 

  18. Lin K., Kondadadi Ravikuma, A Similarity-Based Soft Clustering Algorithm for Documents, in Proc. of the 7th International Conference on Database Systems for Advanced Applications, 40–47, 2001.

    Google Scholar 

  19. Mendes Rodrigues M.E.S. and Sacks L., A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining, in Proc. of the 4th International Conference on Recent Advances in Soft Computing, RASC’2004, 269–274, Nottingham, UK, 2004.

    Google Scholar 

  20. Murtagh. F. A Survey of Recent Advances in Hierarchical Clustering Algorithms which Use Cluster Centres. Computer Journal, 26, 354–359, 1984.

    Google Scholar 

  21. Pedrycz W., Clustering anf Fuzzy Clustering, chapter 1, in Knowledge-based clustering, J. Wiley and Son, 2005.

    Google Scholar 

  22. Salton G., and McGill M.J., Introduction to modern information retrieval. McGraw-Hill Int. Book Co. 1984.

    Google Scholar 

  23. Sebastiani F., Text Categorization. In Text Mining and its Applications, Alessandro Zanasi (ed.), WIT Press, Southampton, UK, 2005.

    Google Scholar 

  24. Sparck Jones, K. A., A Statistical Interpretation of Term Specificity and its Application in Retrieval., Journal of Documentation, 28(1), 11–20, 1972.

    Google Scholar 

  25. Steinbach M., Karypis G., Kumar V., A Comparison of Document Clustering Techniques, In Proc. of KDD Workshop on Text Mining, 2000.

    Google Scholar 

  26. Tang N., Vemuri V.R., Web-based Knowledge Acquisition to Inpute Missing Values for Classification, in Proc. of the 2004 IEEE/WI/ACM Int. Joint Conf. On the Web Intelligence and Intelligent Agent Tech. (WI/IAT-2004). Beijing, China, 2004.

    Google Scholar 

  27. The Ordered Weighted Averaging Operators: Theory and Applications, R.R. Yager and J. Kacprzyk eds., Kluwer Academic Publishers, 1997.

    Google Scholar 

  28. Ungar, L.H., Foster, D.P.: Clustering Methods for Collaborative Filtering. Proceedings of the Workshop on Recommendation Systems, AAAI Press, Menlo Park California, 1998.

    Google Scholar 

  29. van Rijsbergen, C. J. Information Retrieval. London, England, Butterworths & Co., Ltd., 1979.

    Google Scholar 

  30. Wai-chiu Wong, Ada Wai-chee Fu, Incremental Document Clustering for Web Page Classification, in Proc. 2000 Int. Conf. on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000), Aizu-Wakamatsu City, Fukushima, Japan November 5–8, 2000.

    Google Scholar 

  31. Walls F., Jin H., Sista S., Schwartz R., Topic detection in Broadcast News, Proc. of the DARPA Broadcast News Workshop, Feb 28–Mar 3, 1999.

    Google Scholar 

  32. Xuejian Xiong, Kian Lee Tan, Similarity-driven cluster merging method for unsupervised fuzzy clustering, in Proc. of the 20th ACM International Conference on Uncertainty in artificial intelligence, 611–618, 2004.

    Google Scholar 

  33. Zhao Y., Karypis G., Criterion Functions for Document Clustering: Experiments and Analysis. Machine Learning, 2003.

    Google Scholar 

  34. Zhao Y., Karypis G., Empirical and Theoretical Comparisons of Selected Criterion functions for document clustering. Machine Learning, 55, 311–331, 2004.

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Bordogna, G., Pagani1, M., Pasi, G. (2006). A Dynamic Hierarchical Fuzzy Clustering Algorithm for Information Filtering. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_1

Download citation

  • DOI: https://doi.org/10.1007/3-540-31590-X_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-31588-9

  • Online ISBN: 978-3-540-31590-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics