Skip to main content

Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach

  • Conference paper
  • First Online:
Advances in Soft Computing (MICAI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10062))

Included in the following conference series:

Abstract

The clustering of web search results - or web document clustering (WDC) - has become a very interesting research area among academic and scientific communities involved in information retrieval. Systems for the clustering of web search results, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering of web results already exist, but results show there is room for more to be done. This paper introduces a hyper-heuristic framework called WDC-HH, which allows the defining of the best algorithm for WDC. The hyper-heuristic framework uses four high-level-heuristics (performance-based rank selection, tabu selection, random selection and performance-based roulette wheel selection) for selecting low-level heuristics (used to solve the specific problem of WDC). As a low level heuristics the framework considers: harmony search, improved harmony search, novel global harmony search, global-best harmony search, eighteen genetic algorithm variations, particle swarm optimization, artificial bee colony, and differential evolution. The framework uses the k-means algorithm as a local solution improvement strategy and based on the Balanced Bayesian Information Criterion it is able to automatically define the appropriate number of clusters. The framework also uses four acceptance/replacement strategies (replacement heuristics): Replace the worst, Restricted Competition Replacement, Stochastic Replacement and Rank Replacement. WDC-HH was tested with four data sets using a total of 447 queries with their ideal solutions. As a main result of the framework assessment, a new algorithm based on global-best harmony search and rank replacement strategy obtained the best results in WDC problem. This new algorithm was called WDC-HH-BHRK and was also compared against other established WDC algorithms, among them: Suffix Tree Clustering (STC) and Lingo. Results show a considerable improvement -measured by recall, F-measure, fall-out, accuracy and SSLk- over the other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Carpineto, C., et al.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 1–38 (2009)

    Article  Google Scholar 

  2. Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  3. Carpineto, C., D’Amico, M., Romano, G.: Evaluating subtopic retrieval methods: clustering versus diversification of search results. Inf. Process. Manage. 48(2), 358–373 (2012)

    Article  Google Scholar 

  4. Hammouda, K.M.: Web mining: clustering web documents a preliminary review, 1–13 (2001). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4076

  5. Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). doi:10.1007/3-540-28349-8_2

    Chapter  Google Scholar 

  6. Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)

    Article  Google Scholar 

  7. Mahdavi, M., Abolhassani, H.: Harmony K-means algorithm for document clustering. Data Min. Knowl. Disc. 18(3), 370–391 (2009)

    Article  MathSciNet  Google Scholar 

  8. Geem, Z., Kim, J., Loganathan, G.V.: A new heuristic optimization algorithm: harmony search. Simulation 76(2), 60–68 (2001)

    Article  Google Scholar 

  9. Hemalatha, M., Sathyasrinivas, D.: Hybrid neural network model for web document clustering. In: Second International Conference on the Applications of Digital Information and Web Technologies, ICADIWT 2009 (2009)

    Google Scholar 

  10. Carullo, M., Binaghi, E., Gallo, I.: An online document clustering technique for short web contents. Pattern Recogn. Lett. 30(10), 870–876 (2009)

    Article  Google Scholar 

  11. Chehreghani, M.H., Abolhassani, H., Chehreghani, M.H.: Density link-based methods for clustering web pages. Decis. Support Syst. 47(4), 374–382 (2009)

    Article  Google Scholar 

  12. Matsumoto, T., Hung, E.: Fuzzy clustering and relevance ranking of web search results with differentiating cluster label generation. In: 2010 IEEE International Conference on Fuzzy Systems (FUZZ) (2010)

    Google Scholar 

  13. Fersini, E., Messina, E., Archetti, F.: A probabilistic relational approach for web document clustering. Inf. Process. Manage. 46(2), 117–130 (2010)

    Article  Google Scholar 

  14. Lee, I., On, B.-W.: An effective web document clustering algorithm based on bisection and merge. Artif. Intell. Rev. 36(1), 69–85 (2011)

    Article  Google Scholar 

  15. He, X., et al.: Clustering web documents based on multiclass spectral clustering. In: 2011 International Conference on Machine Learning and Cybernetics (ICMLC) (2011)

    Google Scholar 

  16. Osiński, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intell. Syst. 20(3), 48–54 (2005)

    Article  Google Scholar 

  17. Zhang, D., Dong, Y.: Semantic, hierarchical, online clustering of web search results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24655-8_8

    Chapter  Google Scholar 

  18. Fung, B., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)

    Google Scholar 

  19. Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)

    Article  Google Scholar 

  20. Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002: International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). ACM, Edmonton (2002)

    Google Scholar 

  21. Osiński, S.: Improving quality of search results clustering with approximate matrix factorizations. In: 28th European Conference on IR Research (ECIR 2006), London, UK (2006)

    Google Scholar 

  22. Wei, X., Xin, L., Yihong, G.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2003, pp. 267–273. ACM, Toronto (2003)

    Google Scholar 

  23. Zhong-Yuan, Z., Zhang, J.: Survey on the variations and applications of nonnegative matrix factorization. In: ISORA 2010: The Ninth International Symposium on Operations Research and Its Applications. ORSC & APORC, Chengdu-Jiuzhaigou (2010)

    Google Scholar 

  24. Bernardini, A., Carpineto, C., D’Amico, M.: Full-subtopic retrieval with keyphrase-based search results clustering. In: WI-IAT 2009: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (2009)

    Google Scholar 

  25. Zheng, S., et al.: Web document clustering research based on granular computing. In: Second International Symposium on Electronic Commerce and Security, ISECS 2009 (2009)

    Google Scholar 

  26. Navigli, R., Crisafulli, G.: Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 116–126. Association for Computational Linguistics, Cambridge (2010)

    Google Scholar 

  27. Carpineto, C., Romano, G.: Optimal meta search results clustering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 170–177. ACM, Geneva (2010)

    Google Scholar 

  28. Cobos, C., et al.: Web document clustering based on global-best harmony search, k-means, frequent term sets and Bayesian information criterion. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)

    Google Scholar 

  29. Cobos, C., et al.: Web document clustering based on a new niching memetic algorithm, term-document matrix and Bayesian information criterion. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)

    Google Scholar 

  30. Scaiella, U., et al.: Topical clustering of search results. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM, Seattle (2012)

    Google Scholar 

  31. Cobos, C., et al.: Clustering of web search results based on the cuckoo search algorithm and balanced Bayesian information criterion. Inf. Sci. 281, 248–264 (2014)

    Article  Google Scholar 

  32. Burke, E.K., et al.: A Survey of Hyper-heuristics, p. 43. University of Nottingham, Nottingham (2009)

    Google Scholar 

  33. Mısır, M., et al.: An investigation on the generality level of selection hyper-heuristics under different empirical conditions. Appl. Soft Comput. 13(7), 3335–3353 (2013)

    Article  Google Scholar 

  34. Burke, E.K., et al.: A graph-based hyper-heuristic for educational timetabling problems. Eur. J. Oper. Res. 176(1), 177–192 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  35. Pillay, N., Banzhaf, W.: A study of heuristic combinations for hyper-heuristic systems for the uncapacitated examination timetabling problem. Eur. J. Oper. Res. 197(2), 482–491 (2009)

    Article  MATH  Google Scholar 

  36. Thabtah, F., Cowling, P.: Mining the data from a hyperheuristic approach using associative classification. Expert Syst. Appl. 34(2), 1093–1101 (2008)

    Article  Google Scholar 

  37. Grobler, J., et al.: Alternative hyper-heuristic strategies for multi-method global optimization. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)

    Google Scholar 

  38. Villoria García, A., et al.: Hyper-heuristic approaches for the response time variability problem (2010)

    Google Scholar 

  39. Rafique, A.F., et al.: Hyper heuristic approach for design and optimization of satellite launch vehicle. Chin. J. Aeronaut. 24(2), 150–163 (2011)

    Article  Google Scholar 

  40. Li, J., Burke, E.K., Qu, R.: Integrating neural networks and logistic regression to underpin hyper-heuristic search. Knowl.-Based Syst. 24(2), 322–330 (2011)

    Article  Google Scholar 

  41. Cobos, C., Mendoza, M., Leon, E.: A hyper-heuristic approach to design and tuning heuristic methods for web document clustering. In: 2011 IEEE Congress on Evolutionary Computation (CEC). IEEE, New Orleans (2011)

    Google Scholar 

  42. López-Camacho, E., et al.: A unified hyper-heuristic framework for solving bin packing problems. Expert Syst. Appl. 41(15), 6876–6889 (2014)

    Article  Google Scholar 

  43. Maashi, M., Özcan, E., Kendall, G.: A multi-objective hyper-heuristic based on choice function. Expert Syst. Appl. 41(9), 4475–4493 (2014)

    Article  Google Scholar 

  44. Ahmed, L.N., Özcan, E., Kheiri, A.: Solving high school timetabling problems worldwide using selection hyper-heuristics. Expert Syst. Appl. 42(13), 5463–5471 (2015)

    Article  Google Scholar 

  45. Sabar, N.R., Kendall, G.: Population based Monte Carlo tree search hyper-heuristic for combinatorial optimization problems. Inf. Sci. 314, 225–239 (2015)

    Article  Google Scholar 

  46. Rattadilok, P.: An investigation and extension of a hyper-heuristic framework. Informatica 34(4), 523–534 (2010)

    Google Scholar 

  47. Webb, A.: Statistical Pattern Recognition, 2nd edn. Wiley, Hoboken (2002)

    Book  MATH  Google Scholar 

  48. Cobos, C., Muñoz, L., Mendoza, M., León, E., Herrera-Viedma, E.: Fitness function obtained from a genetic programming approach for web document clustering using evolutionary algorithms. In: Pavón, J., Duque-Méndez, N.D., Fuentes-Fernández, R. (eds.) IBERAMIA 2012. LNCS, vol. 7637, pp. 179–188. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34654-5_19

    Chapter  Google Scholar 

  49. Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software, Inc., San Jose (2002)

    Google Scholar 

  50. Mahamed, G.H.O., Andries, P.E., Ayed, S.: An overview of clustering methods. Intell. Data Anal. 11(6), 583–605 (2007)

    Google Scholar 

  51. Redmond, S.J., Heneghan, C.: A method for initialising the K-means clustering algorithm using kd-trees. Pattern Recogn. Lett. 28(8), 965–973 (2007)

    Article  Google Scholar 

  52. Omran, M.G.H., Mahdavi, M.: Global-best harmony search. Appl. Math. Comput. 198(2), 643–656 (2008)

    MathSciNet  MATH  Google Scholar 

  53. Mahdavi, M., et al.: Novel meta-heuristic algorithms for clustering web documents. Appl. Math. Comput. 201(1–2), 441–451 (2008)

    MathSciNet  MATH  Google Scholar 

  54. Jiang, H., Liu, Y., Zheng, L.: Design and simulation of simulated annealing algorithm with harmony search. In: Tan, Y., Shi, Y., Tan, K.C. (eds.) ICSI 2010. LNCS, vol. 6146, pp. 454–460. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13498-2_59

    Chapter  Google Scholar 

  55. Kuri, A., Galaviz, J.: Algoritmos Genéticos. Fondo de Cultura Económica/UNAM/IPM, México (2002)

    Google Scholar 

  56. Bianchi, L., et al.: A survey on metaheuristics for stochastic combinatorial optimization. Nat. Comput.: Int. J. 8(2), 239–287 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  57. Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  58. Omran, M.G.H., Engelbrecht, A.P., Salman, A.: Bare bones differential evolution. Eur. J. Oper. Res. 196(1), 128–139 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  59. Panigrahi, B.K., et al.: Population variance harmony search algorithm to solve optimal power flow with non-smooth cost function. In: Geem, Z.W. (ed.) Recent Advances in Harmony Search Algorithm. Studies in Computational Intelligence, vol. 270, pp. 65–75. Springer, Heidelberg (2010). doi:10.1007/978-3-642-04317-8_6

    Chapter  Google Scholar 

  60. Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol. 25, pp. 359–368. Springer, Heidelberg (2004). doi:10.1007/978-3-540-39985-8_37

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Cobos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Cobos, C., Duque, A., Bolaños, J., Mendoza, M., León, E. (2017). Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds) Advances in Soft Computing. MICAI 2016. Lecture Notes in Computer Science(), vol 10062. Springer, Cham. https://doi.org/10.1007/978-3-319-62428-0_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-62428-0_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-62427-3

  • Online ISBN: 978-3-319-62428-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics