Skip to main content
Log in

A transduction-based approach to fuzzy clustering, relevance ranking and cluster label generation on web search results

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

This paper details a modular, self-contained web search results clustering system that enhances search results by (i) performing clustering on lists of web documents returned by queries to search engines, and (ii) ranking the results and labeling the resulting clusters, by using a calculated relevance value as a degree of membership to clusters. In addition, we demonstrate an external evaluation method based on precision for comparing fuzzy clustering techniques, as well as internal measures suitable for working on non-training data. The built-in label generator uses the membership degrees and relevance values to weight the most relevant results more heavily. The membership degrees of documents to fuzzy clusters also facilitate effective detection and removal of overly similar clusters. To achieve this, our transduction-based clustering algorithm (TCA) and its fuzzy counterpart (FTCA) employ a transduction-based relevance model (TRM) to consider local relationships between each web document. Results from testing on five different real-world and synthetic datasets results show favorable results compared to established label-based clustering algorithms Suffix Tree Clustering (STC) and Lingo.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

References

  • Carpineto, C., Osinski, S., Romano, G., & Weiss, D. (2009). A survey of web clustering engines. ACM Computing Surveys, 41, 17:1–17:36.

    Google Scholar 

  • Carpineto, C., & Romano, G. (2008). Odp239 test collection. Accessed Sept 2010. http://credo.fub.it/ambient.

  • Crabtree, D., Gao, X., & Andreae, P. (2005). Standardized evaluation method for web clustering results. In Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence.

  • Cutting, D. R., Karger, D. R., Pedersen, J. O., & Tukey, J. W. (1992). Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international SIGIR conference on research and development in information retrieval.

  • Demartini, G., Chirita, P. A., Brunkhorst, I., & Nejdl, W. (2008). Ranking categories for web search. In Lecture notes in computer science 4956 (pp. 564–569). Springer.

  • Ferragina, P., & Gulli, A. (2005). A personalized search egnine based on web-snippet hierarchical clustering. In Special interest track and posters of the 14th international conference on World Wide Web.

  • Geraci, F., Pellegrini, M., Maggini, M., & Sebastiani, F. (2006). Cluster generation and cluster labelling for web snippets. In Lecture notes in computer science 4209 (pp. 25–36). Springer.

  • Hammouda, K., & Karray, F. (2000). A comparative study of data clustering techniques. Ph.D. thesis, University of Waterloo.

  • Hearst, M. A. (2006). Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4), 59–61.

    Article  Google Scholar 

  • Janruang, J., & Kreesuradej, W. (2006). A new web search result clustering based on true common phrase label discovery. In Proceedings of the international conference on computational inteligence for modelling control and automation and international conference on intelligent agents web technologies and international commerce.

  • Kaki, M. (2005). Findex: Search result categories help users when document ranking fails. In Proceedings of the SIGCHI conference on human factors in computing systems.

  • Lewandowski, D. (2008). Search engine user behaviour: How can users be guided to quality content? Information Services and Use, 28(3/4), 261–268.

    Google Scholar 

  • Lewis, D. D. (2004). (2004). Reuters-21578 text categorization test collection. Accessed Sept 2010. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  • Matsumoto, T., & Hung, E. (2010). Fuzzy clustering and relevance ranking of web search results with differentiating cluster label generation. In 2010 IEEE world conference on computational intelligence.

  • Mecca, G., Raunich, S., & Pappalardo, A. (2007). A new algorithm for clustering search results. Data and Knowledge Engineering, 62, 504–522.

    Article  Google Scholar 

  • Mendes, M. E. S., & Stacks, L. (2003). Evaluating fuzzy clustering for relevance-based information access. In Proceedings of the IEEE international conference on fuzzy systems.

  • Osinski, S., & Weiss, D. (2005). A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3), 48–54.

    Article  Google Scholar 

  • Stefanowski, J., & Weiss, D. (2003). Carrot2 and language properties in web search results clustering. In Lecture notes in computer science 2663 (pp. 240–249). Springer.

  • van Rijsbergen, C. J. (1979). Information retrieval. Butterworth

  • Xiao, L., & Hung, E. (2008). Clustering web-search results using transduction-based relevance model. In IEEE 1st pacific-asia workshop on web mining and web-based application 2008.

  • Yager, R. R., & Filev, D. P. (1994). Approximate clustering via the mountain method. IEEE Transactions on Systems, Man and Cybernetics, 24, 1279–1284.

    Article  Google Scholar 

  • Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In Proceedings of the 21st annual international sigir conference on research and development in information retrieval.

  • Zhang, G., Liu, Y., Tan, S., & Cheng, X. (2007). A novel method for hierarchical clustering of search results. In Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology—Workshops.

  • Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation. Tech. rep., Carnegie Mellon University.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takazumi Matsumoto.

Additional information

The work described in this paper was partially supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. PolyU 5191/09E, PolyU 5182/08E, PolyU 5174/07E, PolyU 5181/06E, G-U524).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Matsumoto, T., Hung, E. A transduction-based approach to fuzzy clustering, relevance ranking and cluster label generation on web search results. J Intell Inf Syst 38, 419–448 (2012). https://doi.org/10.1007/s10844-011-0161-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-011-0161-8

Keywords

Navigation