A Dynamic Hierarchical Fuzzy Clustering Algorithm for Information Filtering

Bordogna, Gloria; Pagani1, Marco; Pasi, Gabriella

doi:10.1007/3-540-31590-X_1

Gloria Bordogna⁵,
Marco Pagani1⁵ &
Gabriella Pasi⁶

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 197))

380 Accesses
8 Citations

Summary

In this contribution we propose a hierarchical fuzzy clustering algorithm for dynamically supporting information filtering. The idea is that document filtering can draw advantages from a dynamic hierarchical fuzzy clustering of the documents into overlapping topic categories corresponding with different levels of granularity of the categorisation. Users can have either general interests or specific ones depending on their profile and thus they must be feed with documents belonging to the categories of interest that can correspond with either a high level topic, such as sport news, or a subtopics, such as football news, or even a very specific topics such as football matches of their favourite team. The hierarchical structure of the automatically identified clusters is built so that each level corresponds with a distinct level of overlapping of the clusters in it, so that in climbing the hierarchy this value increases since the topics represented in the upper levels are more general, i.e., fuzzier. The hierarchy of fuzzy clusters is used to support the filtering criteria that are personalized based on user profiles. Since a filter monitors one or more continuously feed document streams, the clustering must be able both to generate a fuzzy hierarchical classification of a collection of documents and to update the hierarchy of existing categories by either including newly found documents or detecting new categories when such new documents have contents that are different from those represented by the existing clusters. The fuzzy clustering algorithm is based on a generalization of the fuzzy C-means algorithm that is iteratively applied to each hierarchical level to identify clusters of the higher level. In order to apply this algorithm in document filtering it has been extended so as to use a cosine similarity instead of the usual Euclidean distance, and to automatically estimate the number of the clusters to detect at each hierarchical level. This number is identified based either on an explicit input that specifies the minimum percentage of common index terms that the clusters of the level can share (that is equivalent to indicate a tolerance for overlapping between the topics dealt with in each fuzzy cluster) or on a statistical analysis of the cumulative curve of overlapping degrees between all pairs of clusters of the level. This way the problem of application of the fuzzy C means that requires the specification of the desired number of the clusters is overcome.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Basu S., Banerjee A., Mooney R.J., Semi-supervised Clustering by Seeding, in Proc. 19th Int. Conf. On Machine Learning (ICML-2002). Sydney, 2002.
Google Scholar
Bordogna G., Pasi G., Personalised Indexing and Retrieval of Heterogeneous Structured Documents, Information Retrieval Journal, 8, 301–318, 2005.
Article Google Scholar
Claypool M., Gokhale A., Miranda T., Murnikov P., Netes D., Sartin M., Combining Content-based and Collaborative Filters in an Online Newspaper, in Proc. ACM SIGIR’99 Workshop on Recommender Systems-Implemenation and Evaluation, Berkeley CA, 1999.
Google Scholar
Connor M., Herlocker J., Clustering for Collaborative Filtering, in Proc. of ACM SIGIR Workshop on Recommender Systems, Berkeley CA, 1999.
Google Scholar
Cutting D.R., Karger D.R., Pedersen J.O., Tukey J.W., Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, in Proc. of 15th Ann In. SIGIR’92., 1992.
Google Scholar
Debole F., Sebastiani F., Supervised Term Weighting for Automated Text Categorization. In Proc. SAC-03, 18th ACM Symposium on Applied Computing, 2003.
Google Scholar
Dominich S., Goth J., Kiezer T., Szlavik Z., Entropy-based interpretation of Retrieval Status Value-based Retrieval. Journal of the American Society for Information Science and Technology. John Wiley & Sons, 55(7), 613–627, 2004.
Article Google Scholar
Estivill-Castro V., Why so Many Clustering Algorithms: a Position Paper, ACM SIGKDD Explorations Newsletter, 4(1), 2002.
Google Scholar
Everitt B.S., Cluster Analysis, 3rd edition. Edward Arnold /Halsted Press, London, 1992.
Google Scholar
Grossman D.A., Information retrieval, Algorithms and Heuristics, Kluwer Academic Publishers, 1998.
Google Scholar
Hathaway, R.J., Bezdek, J.C. and Hu Y., Generalized Fuzzy C-Means Clustering Strategies Using Lp Norm Distances, IEEE Transactions on Fuzzy Systems, 8(5), 576–582, 2000.
Article Google Scholar
Herrera-Viedma E., Herrera F., Martinez L., Herrera J.C., Lopez A.G., Incorporatine Filtering Techniques in a Fuzzy Linguistic Multi-Agent Model for Information Gathering on the Web, Fuzzy sets and Systems, 148, 61–83, 2004.
Article MATH MathSciNet Google Scholar
http://www.newsinessence.com.
Google Scholar
Jain A.K., Murty M.N., Flynn P.J., Data Clustering: a Review, ACM Computing Surveys, 31(3), 264–323, 1999.
Article Google Scholar
Jung, SungYoung, Taek-Soo Kim, An Incremental Similarity Computation Method in Agglomerative Hierarchical Clustering, in Proc. Of the 2nd International Symposium on Advanced Intelligent Systems, Daejeon, Korea, August 25, 2001
Google Scholar
Khaled M. Hammouda, Mohamed S. Kamel: Incremental Document Clustering Using Cluster Similarity Histograms. 597–601, 2003.
Google Scholar
Kraft D., Chen J., Martin-Bautista M.J., Vila M.A., Textual Information Retrieval with User Profiles using Fuzzy Clustering and Inferencing, in Intelligent Exploration of the Web, Szczepaniak P., Segovia J., Kacprzyk J., Zadeh L.A., Studies in Fuzziness and Soft Comp. Series, 111, Physica Verlag, 2003.
Google Scholar
Lin K., Kondadadi Ravikuma, A Similarity-Based Soft Clustering Algorithm for Documents, in Proc. of the 7th International Conference on Database Systems for Advanced Applications, 40–47, 2001.
Google Scholar
Mendes Rodrigues M.E.S. and Sacks L., A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining, in Proc. of the 4th International Conference on Recent Advances in Soft Computing, RASC’2004, 269–274, Nottingham, UK, 2004.
Google Scholar
Murtagh. F. A Survey of Recent Advances in Hierarchical Clustering Algorithms which Use Cluster Centres. Computer Journal, 26, 354–359, 1984.
Google Scholar
Pedrycz W., Clustering anf Fuzzy Clustering, chapter 1, in Knowledge-based clustering, J. Wiley and Son, 2005.
Google Scholar
Salton G., and McGill M.J., Introduction to modern information retrieval. McGraw-Hill Int. Book Co. 1984.
Google Scholar
Sebastiani F., Text Categorization. In Text Mining and its Applications, Alessandro Zanasi (ed.), WIT Press, Southampton, UK, 2005.
Google Scholar
Sparck Jones, K. A., A Statistical Interpretation of Term Specificity and its Application in Retrieval., Journal of Documentation, 28(1), 11–20, 1972.
Google Scholar
Steinbach M., Karypis G., Kumar V., A Comparison of Document Clustering Techniques, In Proc. of KDD Workshop on Text Mining, 2000.
Google Scholar
Tang N., Vemuri V.R., Web-based Knowledge Acquisition to Inpute Missing Values for Classification, in Proc. of the 2004 IEEE/WI/ACM Int. Joint Conf. On the Web Intelligence and Intelligent Agent Tech. (WI/IAT-2004). Beijing, China, 2004.
Google Scholar
The Ordered Weighted Averaging Operators: Theory and Applications, R.R. Yager and J. Kacprzyk eds., Kluwer Academic Publishers, 1997.
Google Scholar
Ungar, L.H., Foster, D.P.: Clustering Methods for Collaborative Filtering. Proceedings of the Workshop on Recommendation Systems, AAAI Press, Menlo Park California, 1998.
Google Scholar
van Rijsbergen, C. J. Information Retrieval. London, England, Butterworths & Co., Ltd., 1979.
Google Scholar
Wai-chiu Wong, Ada Wai-chee Fu, Incremental Document Clustering for Web Page Classification, in Proc. 2000 Int. Conf. on Information Society in the 21st Century: Emerging Technologies and New Challenges (IS2000), Aizu-Wakamatsu City, Fukushima, Japan November 5–8, 2000.
Google Scholar
Walls F., Jin H., Sista S., Schwartz R., Topic detection in Broadcast News, Proc. of the DARPA Broadcast News Workshop, Feb 28–Mar 3, 1999.
Google Scholar
Xuejian Xiong, Kian Lee Tan, Similarity-driven cluster merging method for unsupervised fuzzy clustering, in Proc. of the 20th ACM International Conference on Uncertainty in artificial intelligence, 611–618, 2004.
Google Scholar
Zhao Y., Karypis G., Criterion Functions for Document Clustering: Experiments and Analysis. Machine Learning, 2003.
Google Scholar
Zhao Y., Karypis G., Empirical and Theoretical Comparisons of Selected Criterion functions for document clustering. Machine Learning, 55, 311–331, 2004.
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Gruppo di Georisorse, CNR — IDPA, Sez. di Milano, via Pasubio 5, c/o POINT, 24044, Dalmine (BG), Italy
Gloria Bordogna & Marco Pagani1
Dip. Di Informatica, Sistemistica e Comunicazione Univ. Degli Studi di Milano, Bicocca p.le Ateneo Nuovo, 1, Milano, Italy
Gabriella Pasi

Authors

Gloria Bordogna
View author publications
You can also search for this author in PubMed Google Scholar
Marco Pagani1
View author publications
You can also search for this author in PubMed Google Scholar
Gabriella Pasi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and A.I E.T.S.I. Informatica, University of Granada, C/Periodista Daniel, Saucedo Aranda s/n, Granada, Spain
Enrique Herrera-Viedma
Department of Informatics Systems and Communication (DISCo), Università degli Studi di Milano Bicocca, Via Bicocca degli Arcimboldi, 8 (Edificio U7), 20126, Milano, Itay
Gabriella Pasi
Department of Computer and Information Sciences, University of Strathclyde, Livingstone Tower, 26 Richmond Street, Glasgow, G1 1XH, Scotland, UK
Fabio Crestani

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bordogna, G., Pagani1, M., Pasi, G. (2006). A Dynamic Hierarchical Fuzzy Clustering Algorithm for Information Filtering. In: Herrera-Viedma, E., Pasi, G., Crestani, F. (eds) Soft Computing in Web Information Retrieval. Studies in Fuzziness and Soft Computing, vol 197. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31590-X_1

Download citation

DOI: https://doi.org/10.1007/3-540-31590-X_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31588-9
Online ISBN: 978-3-540-31590-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics