Abstract
The clustering of web search results - or web document clustering (WDC) - has become a very interesting research area among academic and scientific communities involved in information retrieval. Systems for the clustering of web search results, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering of web results already exist, but results show there is room for more to be done. This paper introduces a hyper-heuristic framework called WDC-HH, which allows the defining of the best algorithm for WDC. The hyper-heuristic framework uses four high-level-heuristics (performance-based rank selection, tabu selection, random selection and performance-based roulette wheel selection) for selecting low-level heuristics (used to solve the specific problem of WDC). As a low level heuristics the framework considers: harmony search, improved harmony search, novel global harmony search, global-best harmony search, eighteen genetic algorithm variations, particle swarm optimization, artificial bee colony, and differential evolution. The framework uses the k-means algorithm as a local solution improvement strategy and based on the Balanced Bayesian Information Criterion it is able to automatically define the appropriate number of clusters. The framework also uses four acceptance/replacement strategies (replacement heuristics): Replace the worst, Restricted Competition Replacement, Stochastic Replacement and Rank Replacement. WDC-HH was tested with four data sets using a total of 447 queries with their ideal solutions. As a main result of the framework assessment, a new algorithm based on global-best harmony search and rank replacement strategy obtained the best results in WDC problem. This new algorithm was called WDC-HH-BHRK and was also compared against other established WDC algorithms, among them: Suffix Tree Clustering (STC) and Lingo. Results show a considerable improvement -measured by recall, F-measure, fall-out, accuracy and SSLk- over the other algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carpineto, C., et al.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 1–38 (2009)
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Carpineto, C., D’Amico, M., Romano, G.: Evaluating subtopic retrieval methods: clustering versus diversification of search results. Inf. Process. Manage. 48(2), 358–373 (2012)
Hammouda, K.M.: Web mining: clustering web documents a preliminary review, 1–13 (2001). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4076
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). doi:10.1007/3-540-28349-8_2
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Mahdavi, M., Abolhassani, H.: Harmony K-means algorithm for document clustering. Data Min. Knowl. Disc. 18(3), 370–391 (2009)
Geem, Z., Kim, J., Loganathan, G.V.: A new heuristic optimization algorithm: harmony search. Simulation 76(2), 60–68 (2001)
Hemalatha, M., Sathyasrinivas, D.: Hybrid neural network model for web document clustering. In: Second International Conference on the Applications of Digital Information and Web Technologies, ICADIWT 2009 (2009)
Carullo, M., Binaghi, E., Gallo, I.: An online document clustering technique for short web contents. Pattern Recogn. Lett. 30(10), 870–876 (2009)
Chehreghani, M.H., Abolhassani, H., Chehreghani, M.H.: Density link-based methods for clustering web pages. Decis. Support Syst. 47(4), 374–382 (2009)
Matsumoto, T., Hung, E.: Fuzzy clustering and relevance ranking of web search results with differentiating cluster label generation. In: 2010 IEEE International Conference on Fuzzy Systems (FUZZ) (2010)
Fersini, E., Messina, E., Archetti, F.: A probabilistic relational approach for web document clustering. Inf. Process. Manage. 46(2), 117–130 (2010)
Lee, I., On, B.-W.: An effective web document clustering algorithm based on bisection and merge. Artif. Intell. Rev. 36(1), 69–85 (2011)
He, X., et al.: Clustering web documents based on multiclass spectral clustering. In: 2011 International Conference on Machine Learning and Cybernetics (ICMLC) (2011)
Osiński, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intell. Syst. 20(3), 48–54 (2005)
Zhang, D., Dong, Y.: Semantic, hierarchical, online clustering of web search results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24655-8_8
Fung, B., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)
Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002: International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). ACM, Edmonton (2002)
Osiński, S.: Improving quality of search results clustering with approximate matrix factorizations. In: 28th European Conference on IR Research (ECIR 2006), London, UK (2006)
Wei, X., Xin, L., Yihong, G.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2003, pp. 267–273. ACM, Toronto (2003)
Zhong-Yuan, Z., Zhang, J.: Survey on the variations and applications of nonnegative matrix factorization. In: ISORA 2010: The Ninth International Symposium on Operations Research and Its Applications. ORSC & APORC, Chengdu-Jiuzhaigou (2010)
Bernardini, A., Carpineto, C., D’Amico, M.: Full-subtopic retrieval with keyphrase-based search results clustering. In: WI-IAT 2009: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (2009)
Zheng, S., et al.: Web document clustering research based on granular computing. In: Second International Symposium on Electronic Commerce and Security, ISECS 2009 (2009)
Navigli, R., Crisafulli, G.: Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 116–126. Association for Computational Linguistics, Cambridge (2010)
Carpineto, C., Romano, G.: Optimal meta search results clustering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 170–177. ACM, Geneva (2010)
Cobos, C., et al.: Web document clustering based on global-best harmony search, k-means, frequent term sets and Bayesian information criterion. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)
Cobos, C., et al.: Web document clustering based on a new niching memetic algorithm, term-document matrix and Bayesian information criterion. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)
Scaiella, U., et al.: Topical clustering of search results. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM, Seattle (2012)
Cobos, C., et al.: Clustering of web search results based on the cuckoo search algorithm and balanced Bayesian information criterion. Inf. Sci. 281, 248–264 (2014)
Burke, E.K., et al.: A Survey of Hyper-heuristics, p. 43. University of Nottingham, Nottingham (2009)
Mısır, M., et al.: An investigation on the generality level of selection hyper-heuristics under different empirical conditions. Appl. Soft Comput. 13(7), 3335–3353 (2013)
Burke, E.K., et al.: A graph-based hyper-heuristic for educational timetabling problems. Eur. J. Oper. Res. 176(1), 177–192 (2007)
Pillay, N., Banzhaf, W.: A study of heuristic combinations for hyper-heuristic systems for the uncapacitated examination timetabling problem. Eur. J. Oper. Res. 197(2), 482–491 (2009)
Thabtah, F., Cowling, P.: Mining the data from a hyperheuristic approach using associative classification. Expert Syst. Appl. 34(2), 1093–1101 (2008)
Grobler, J., et al.: Alternative hyper-heuristic strategies for multi-method global optimization. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)
Villoria García, A., et al.: Hyper-heuristic approaches for the response time variability problem (2010)
Rafique, A.F., et al.: Hyper heuristic approach for design and optimization of satellite launch vehicle. Chin. J. Aeronaut. 24(2), 150–163 (2011)
Li, J., Burke, E.K., Qu, R.: Integrating neural networks and logistic regression to underpin hyper-heuristic search. Knowl.-Based Syst. 24(2), 322–330 (2011)
Cobos, C., Mendoza, M., Leon, E.: A hyper-heuristic approach to design and tuning heuristic methods for web document clustering. In: 2011 IEEE Congress on Evolutionary Computation (CEC). IEEE, New Orleans (2011)
López-Camacho, E., et al.: A unified hyper-heuristic framework for solving bin packing problems. Expert Syst. Appl. 41(15), 6876–6889 (2014)
Maashi, M., Özcan, E., Kendall, G.: A multi-objective hyper-heuristic based on choice function. Expert Syst. Appl. 41(9), 4475–4493 (2014)
Ahmed, L.N., Özcan, E., Kheiri, A.: Solving high school timetabling problems worldwide using selection hyper-heuristics. Expert Syst. Appl. 42(13), 5463–5471 (2015)
Sabar, N.R., Kendall, G.: Population based Monte Carlo tree search hyper-heuristic for combinatorial optimization problems. Inf. Sci. 314, 225–239 (2015)
Rattadilok, P.: An investigation and extension of a hyper-heuristic framework. Informatica 34(4), 523–534 (2010)
Webb, A.: Statistical Pattern Recognition, 2nd edn. Wiley, Hoboken (2002)
Cobos, C., Muñoz, L., Mendoza, M., León, E., Herrera-Viedma, E.: Fitness function obtained from a genetic programming approach for web document clustering using evolutionary algorithms. In: Pavón, J., Duque-Méndez, N.D., Fuentes-Fernández, R. (eds.) IBERAMIA 2012. LNCS, vol. 7637, pp. 179–188. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34654-5_19
Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software, Inc., San Jose (2002)
Mahamed, G.H.O., Andries, P.E., Ayed, S.: An overview of clustering methods. Intell. Data Anal. 11(6), 583–605 (2007)
Redmond, S.J., Heneghan, C.: A method for initialising the K-means clustering algorithm using kd-trees. Pattern Recogn. Lett. 28(8), 965–973 (2007)
Omran, M.G.H., Mahdavi, M.: Global-best harmony search. Appl. Math. Comput. 198(2), 643–656 (2008)
Mahdavi, M., et al.: Novel meta-heuristic algorithms for clustering web documents. Appl. Math. Comput. 201(1–2), 441–451 (2008)
Jiang, H., Liu, Y., Zheng, L.: Design and simulation of simulated annealing algorithm with harmony search. In: Tan, Y., Shi, Y., Tan, K.C. (eds.) ICSI 2010. LNCS, vol. 6146, pp. 454–460. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13498-2_59
Kuri, A., Galaviz, J.: Algoritmos Genéticos. Fondo de Cultura Económica/UNAM/IPM, México (2002)
Bianchi, L., et al.: A survey on metaheuristics for stochastic combinatorial optimization. Nat. Comput.: Int. J. 8(2), 239–287 (2009)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Omran, M.G.H., Engelbrecht, A.P., Salman, A.: Bare bones differential evolution. Eur. J. Oper. Res. 196(1), 128–139 (2009)
Panigrahi, B.K., et al.: Population variance harmony search algorithm to solve optimal power flow with non-smooth cost function. In: Geem, Z.W. (ed.) Recent Advances in Harmony Search Algorithm. Studies in Computational Intelligence, vol. 270, pp. 65–75. Springer, Heidelberg (2010). doi:10.1007/978-3-642-04317-8_6
Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol. 25, pp. 359–368. Springer, Heidelberg (2004). doi:10.1007/978-3-540-39985-8_37
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Cobos, C., Duque, A., Bolaños, J., Mendoza, M., León, E. (2017). Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds) Advances in Soft Computing. MICAI 2016. Lecture Notes in Computer Science(), vol 10062. Springer, Cham. https://doi.org/10.1007/978-3-319-62428-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-62428-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62427-3
Online ISBN: 978-3-319-62428-0
eBook Packages: Computer ScienceComputer Science (R0)