Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach

Cobos, Carlos; Duque, Andrea; Bolaños, Jamith; Mendoza, Martha; León, Elizabeth

doi:10.1007/978-3-319-62428-0_24

Carlos Cobos¹⁵,
Andrea Duque¹⁵,
Jamith Bolaños¹⁵,
Martha Mendoza¹⁵ &
…
Elizabeth León¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10062))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1152 Accesses
1 Citations

Abstract

The clustering of web search results - or web document clustering (WDC) - has become a very interesting research area among academic and scientific communities involved in information retrieval. Systems for the clustering of web search results, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering of web results already exist, but results show there is room for more to be done. This paper introduces a hyper-heuristic framework called WDC-HH, which allows the defining of the best algorithm for WDC. The hyper-heuristic framework uses four high-level-heuristics (performance-based rank selection, tabu selection, random selection and performance-based roulette wheel selection) for selecting low-level heuristics (used to solve the specific problem of WDC). As a low level heuristics the framework considers: harmony search, improved harmony search, novel global harmony search, global-best harmony search, eighteen genetic algorithm variations, particle swarm optimization, artificial bee colony, and differential evolution. The framework uses the k-means algorithm as a local solution improvement strategy and based on the Balanced Bayesian Information Criterion it is able to automatically define the appropriate number of clusters. The framework also uses four acceptance/replacement strategies (replacement heuristics): Replace the worst, Restricted Competition Replacement, Stochastic Replacement and Rank Replacement. WDC-HH was tested with four data sets using a total of 447 queries with their ideal solutions. As a main result of the framework assessment, a new algorithm based on global-best harmony search and rank replacement strategy obtained the best results in WDC problem. This new algorithm was called WDC-HH-BHRK and was also compared against other established WDC algorithms, among them: Suffix Tree Clustering (STC) and Lingo. Results show a considerable improvement -measured by recall, F-measure, fall-out, accuracy and SSL_k- over the other algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carpineto, C., et al.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 1–38 (2009)
Article Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Carpineto, C., D’Amico, M., Romano, G.: Evaluating subtopic retrieval methods: clustering versus diversification of search results. Inf. Process. Manage. 48(2), 358–373 (2012)
Article Google Scholar
Hammouda, K.M.: Web mining: clustering web documents a preliminary review, 1–13 (2001). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.4076
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). doi:10.1007/3-540-28349-8_2
Chapter Google Scholar
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Article Google Scholar
Mahdavi, M., Abolhassani, H.: Harmony K-means algorithm for document clustering. Data Min. Knowl. Disc. 18(3), 370–391 (2009)
Article MathSciNet Google Scholar
Geem, Z., Kim, J., Loganathan, G.V.: A new heuristic optimization algorithm: harmony search. Simulation 76(2), 60–68 (2001)
Article Google Scholar
Hemalatha, M., Sathyasrinivas, D.: Hybrid neural network model for web document clustering. In: Second International Conference on the Applications of Digital Information and Web Technologies, ICADIWT 2009 (2009)
Google Scholar
Carullo, M., Binaghi, E., Gallo, I.: An online document clustering technique for short web contents. Pattern Recogn. Lett. 30(10), 870–876 (2009)
Article Google Scholar
Chehreghani, M.H., Abolhassani, H., Chehreghani, M.H.: Density link-based methods for clustering web pages. Decis. Support Syst. 47(4), 374–382 (2009)
Article Google Scholar
Matsumoto, T., Hung, E.: Fuzzy clustering and relevance ranking of web search results with differentiating cluster label generation. In: 2010 IEEE International Conference on Fuzzy Systems (FUZZ) (2010)
Google Scholar
Fersini, E., Messina, E., Archetti, F.: A probabilistic relational approach for web document clustering. Inf. Process. Manage. 46(2), 117–130 (2010)
Article Google Scholar
Lee, I., On, B.-W.: An effective web document clustering algorithm based on bisection and merge. Artif. Intell. Rev. 36(1), 69–85 (2011)
Article Google Scholar
He, X., et al.: Clustering web documents based on multiclass spectral clustering. In: 2011 International Conference on Machine Learning and Cybernetics (ICMLC) (2011)
Google Scholar
Osiński, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intell. Syst. 20(3), 48–54 (2005)
Article Google Scholar
Zhang, D., Dong, Y.: Semantic, hierarchical, online clustering of web search results. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 69–78. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24655-8_8
Chapter Google Scholar
Fung, B., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)
Google Scholar
Mecca, G., Raunich, S., Pappalardo, A.: A new algorithm for clustering search results. Data Knowl. Eng. 62(3), 504–522 (2007)
Article Google Scholar
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD 2002: International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). ACM, Edmonton (2002)
Google Scholar
Osiński, S.: Improving quality of search results clustering with approximate matrix factorizations. In: 28th European Conference on IR Research (ECIR 2006), London, UK (2006)
Google Scholar
Wei, X., Xin, L., Yihong, G.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2003, pp. 267–273. ACM, Toronto (2003)
Google Scholar
Zhong-Yuan, Z., Zhang, J.: Survey on the variations and applications of nonnegative matrix factorization. In: ISORA 2010: The Ninth International Symposium on Operations Research and Its Applications. ORSC & APORC, Chengdu-Jiuzhaigou (2010)
Google Scholar
Bernardini, A., Carpineto, C., D’Amico, M.: Full-subtopic retrieval with keyphrase-based search results clustering. In: WI-IAT 2009: IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies (2009)
Google Scholar
Zheng, S., et al.: Web document clustering research based on granular computing. In: Second International Symposium on Electronic Commerce and Security, ISECS 2009 (2009)
Google Scholar
Navigli, R., Crisafulli, G.: Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 116–126. Association for Computational Linguistics, Cambridge (2010)
Google Scholar
Carpineto, C., Romano, G.: Optimal meta search results clustering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 170–177. ACM, Geneva (2010)
Google Scholar
Cobos, C., et al.: Web document clustering based on global-best harmony search, k-means, frequent term sets and Bayesian information criterion. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)
Google Scholar
Cobos, C., et al.: Web document clustering based on a new niching memetic algorithm, term-document matrix and Bayesian information criterion. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)
Google Scholar
Scaiella, U., et al.: Topical clustering of search results. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223–232. ACM, Seattle (2012)
Google Scholar
Cobos, C., et al.: Clustering of web search results based on the cuckoo search algorithm and balanced Bayesian information criterion. Inf. Sci. 281, 248–264 (2014)
Article Google Scholar
Burke, E.K., et al.: A Survey of Hyper-heuristics, p. 43. University of Nottingham, Nottingham (2009)
Google Scholar
Mısır, M., et al.: An investigation on the generality level of selection hyper-heuristics under different empirical conditions. Appl. Soft Comput. 13(7), 3335–3353 (2013)
Article Google Scholar
Burke, E.K., et al.: A graph-based hyper-heuristic for educational timetabling problems. Eur. J. Oper. Res. 176(1), 177–192 (2007)
Article MathSciNet MATH Google Scholar
Pillay, N., Banzhaf, W.: A study of heuristic combinations for hyper-heuristic systems for the uncapacitated examination timetabling problem. Eur. J. Oper. Res. 197(2), 482–491 (2009)
Article MATH Google Scholar
Thabtah, F., Cowling, P.: Mining the data from a hyperheuristic approach using associative classification. Expert Syst. Appl. 34(2), 1093–1101 (2008)
Article Google Scholar
Grobler, J., et al.: Alternative hyper-heuristic strategies for multi-method global optimization. In: 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, Barcelona (2010)
Google Scholar
Villoria García, A., et al.: Hyper-heuristic approaches for the response time variability problem (2010)
Google Scholar
Rafique, A.F., et al.: Hyper heuristic approach for design and optimization of satellite launch vehicle. Chin. J. Aeronaut. 24(2), 150–163 (2011)
Article Google Scholar
Li, J., Burke, E.K., Qu, R.: Integrating neural networks and logistic regression to underpin hyper-heuristic search. Knowl.-Based Syst. 24(2), 322–330 (2011)
Article Google Scholar
Cobos, C., Mendoza, M., Leon, E.: A hyper-heuristic approach to design and tuning heuristic methods for web document clustering. In: 2011 IEEE Congress on Evolutionary Computation (CEC). IEEE, New Orleans (2011)
Google Scholar
López-Camacho, E., et al.: A unified hyper-heuristic framework for solving bin packing problems. Expert Syst. Appl. 41(15), 6876–6889 (2014)
Article Google Scholar
Maashi, M., Özcan, E., Kendall, G.: A multi-objective hyper-heuristic based on choice function. Expert Syst. Appl. 41(9), 4475–4493 (2014)
Article Google Scholar
Ahmed, L.N., Özcan, E., Kheiri, A.: Solving high school timetabling problems worldwide using selection hyper-heuristics. Expert Syst. Appl. 42(13), 5463–5471 (2015)
Article Google Scholar
Sabar, N.R., Kendall, G.: Population based Monte Carlo tree search hyper-heuristic for combinatorial optimization problems. Inf. Sci. 314, 225–239 (2015)
Article Google Scholar
Rattadilok, P.: An investigation and extension of a hyper-heuristic framework. Informatica 34(4), 523–534 (2010)
Google Scholar
Webb, A.: Statistical Pattern Recognition, 2nd edn. Wiley, Hoboken (2002)
Book MATH Google Scholar
Cobos, C., Muñoz, L., Mendoza, M., León, E., Herrera-Viedma, E.: Fitness function obtained from a genetic programming approach for web document clustering using evolutionary algorithms. In: Pavón, J., Duque-Méndez, N.D., Fuentes-Fernández, R. (eds.) IBERAMIA 2012. LNCS, vol. 7637, pp. 179–188. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34654-5_19
Chapter Google Scholar
Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software, Inc., San Jose (2002)
Google Scholar
Mahamed, G.H.O., Andries, P.E., Ayed, S.: An overview of clustering methods. Intell. Data Anal. 11(6), 583–605 (2007)
Google Scholar
Redmond, S.J., Heneghan, C.: A method for initialising the K-means clustering algorithm using kd-trees. Pattern Recogn. Lett. 28(8), 965–973 (2007)
Article Google Scholar
Omran, M.G.H., Mahdavi, M.: Global-best harmony search. Appl. Math. Comput. 198(2), 643–656 (2008)
MathSciNet MATH Google Scholar
Mahdavi, M., et al.: Novel meta-heuristic algorithms for clustering web documents. Appl. Math. Comput. 201(1–2), 441–451 (2008)
MathSciNet MATH Google Scholar
Jiang, H., Liu, Y., Zheng, L.: Design and simulation of simulated annealing algorithm with harmony search. In: Tan, Y., Shi, Y., Tan, K.C. (eds.) ICSI 2010. LNCS, vol. 6146, pp. 454–460. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13498-2_59
Chapter Google Scholar
Kuri, A., Galaviz, J.: Algoritmos Genéticos. Fondo de Cultura Económica/UNAM/IPM, México (2002)
Google Scholar
Bianchi, L., et al.: A survey on metaheuristics for stochastic combinatorial optimization. Nat. Comput.: Int. J. 8(2), 239–287 (2009)
Article MathSciNet MATH Google Scholar
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Omran, M.G.H., Engelbrecht, A.P., Salman, A.: Bare bones differential evolution. Eur. J. Oper. Res. 196(1), 128–139 (2009)
Article MathSciNet MATH Google Scholar
Panigrahi, B.K., et al.: Population variance harmony search algorithm to solve optimal power flow with non-smooth cost function. In: Geem, Z.W. (ed.) Recent Advances in Harmony Search Algorithm. Studies in Computational Intelligence, vol. 270, pp. 65–75. Springer, Heidelberg (2010). doi:10.1007/978-3-642-04317-8_6
Chapter Google Scholar
Osiński, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol. 25, pp. 359–368. Springer, Heidelberg (2004). doi:10.1007/978-3-540-39985-8_37
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Universidad del Cauca, Popayán, Colombia
Carlos Cobos, Andrea Duque, Jamith Bolaños & Martha Mendoza
Universidad Nacional de Colombia, Bogotá D.C., Colombia
Elizabeth León

Authors

Carlos Cobos
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Duque
View author publications
You can also search for this author in PubMed Google Scholar
Jamith Bolaños
View author publications
You can also search for this author in PubMed Google Scholar
Martha Mendoza
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth León
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Cobos .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Unidad Profesional Interdisciplinaria en Ingeniería y Tecnologías Avanzadas, México DF, Mexico
Obdulia Pichardo-Lagunas
INFOTEC Aguascalientes, Aguascalientes, Mexico
Sabino Miranda-Jiménez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cobos, C., Duque, A., Bolaños, J., Mendoza, M., León, E. (2017). Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach. In: Pichardo-Lagunas, O., Miranda-Jiménez, S. (eds) Advances in Soft Computing. MICAI 2016. Lecture Notes in Computer Science(), vol 10062. Springer, Cham. https://doi.org/10.1007/978-3-319-62428-0_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-62428-0_24
Published: 02 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-62427-3
Online ISBN: 978-3-319-62428-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics