Abstract
Seed URLs, Content Classification, Indexing and Ranking are key factors for search results relevance. Domain specific search engines (DSSE) provide more relevant search results as they have lesser ambiguity issues. For wide usage of DSSEs, identification of seed URLs and related child URLs is required. Identification of seed URLs has been manual and takes longer duration for building/decisioning on URL availability for DSSE. We propose nature inspired Artificial Bee Colony algorithm for identification and scoring of seed and child URLs. We implemented the algorithm on ‘Security’ domain and extracted 34,007 seed URLs from Wikipedia data dump and 323,488 child URLs using the seed URLs. Based on the volume and the relevance of the extracted URLs, a decision for building a DSSE can be made easily.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ahmadi-Abkenari, F., Selamat, A.: An architecture for a focused trend parallel web crawler with the application of clickstream analysis. Inf. Sci. 184(1), 266–281 (2012)
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, pp. 148–159. ACM (2002)
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M., et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
Du, Y., Hai, Y., Xie, C., Wang, X.: An approach for selecting seed urls of focused crawler based on user-interest ontology. Appl. Soft Comput. 14, 663–676 (2014)
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: 16th International Joint Conference on Artificial Intelligence (IJCAI 99), vol. 2, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Karaboga, D., Akay, B.: A comparative study of artificial bee colony algorithm. Appl. Math. Comput. 214(1), 108–132 (2009)
Karaboga, D., Akay, B.: A survey: Algorithms simulating bee swarm intelligence. Artif. Intell. Rev. 31(1–4), 61–85 (2009)
Karaboga, D., Gorkemli, B., Ozturk, C., Karaboga, N.: A comprehensive survey: artificial bee colony (abc) algorithm and applications. Artif. Intell. Rev. 42(1), 21–57 (2014)
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: A machine learning approach to building domain-specific search engines. In: IJCAI, vol. 99, pp. 662–667. Citeseer (1999)
Najork, M.: Web crawler architecture. In: Encyclopedia of Database Systems, pp. 3462–3465. Springer (2009)
Pappas, N., Katsimpras, G., Stamatatos, E.: An agent-based focused crawling framework for topic-and genre-related web document discovery. In: IEEE 24th International Conference on Tools with Artificial Intelligence, vol. 1, pp. 508–515. IEEE (2012)
Zheng, S., Dmitriev, P., Giles, C.L.: Graph-based seed selection for web-scale crawlers. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1967–1970. ACM (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Sanagavarapu, L.M., Sarangi, S., Reddy, Y.R. (2018). ABC Algorithm for URL Extraction. In: Garrigós, I., Wimmer, M. (eds) Current Trends in Web Engineering. ICWE 2017. Lecture Notes in Computer Science(), vol 10544. Springer, Cham. https://doi.org/10.1007/978-3-319-74433-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-74433-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74432-2
Online ISBN: 978-3-319-74433-9
eBook Packages: Computer ScienceComputer Science (R0)