ABC Algorithm for URL Extraction

Sanagavarapu, Lalit Mohan; Sarangi, Sourav; Reddy, Y. Raghu

doi:10.1007/978-3-319-74433-9_12

ABC Algorithm for URL Extraction

Lalit Mohan Sanagavarapu ORCID: orcid.org/0000-0003-0745-1042¹⁵,
Sourav Sarangi¹⁵ &
Y. Raghu Reddy¹⁵

Conference paper
First Online: 22 February 2018

2091 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10544))

Abstract

Seed URLs, Content Classification, Indexing and Ranking are key factors for search results relevance. Domain specific search engines (DSSE) provide more relevant search results as they have lesser ambiguity issues. For wide usage of DSSEs, identification of seed URLs and related child URLs is required. Identification of seed URLs has been manual and takes longer duration for building/decisioning on URL availability for DSSE. We propose nature inspired Artificial Bee Colony algorithm for identification and scoring of seed and child URLs. We implemented the algorithm on ‘Security’ domain and extracted 34,007 seed URLs from Wikipedia data dump and 323,488 child URLs using the seed URLs. Based on the volume and the relevance of the extracted URLs, a decision for building a DSSE can be made easily.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Ahmadi-Abkenari, F., Selamat, A.: An architecture for a focused trend parallel web crawler with the application of clickstream analysis. Inf. Sci. 184(1), 266–281 (2012)
Article Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, pp. 148–159. ACM (2002)
Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M., et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
Google Scholar
Du, Y., Hai, Y., Xie, C., Wang, X.: An approach for selecting seed urls of focused crawler based on user-interest ontology. Appl. Soft Comput. 14, 663–676 (2014)
Article Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: 16th International Joint Conference on Artificial Intelligence (IJCAI 99), vol. 2, pp. 668–673. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar
Karaboga, D., Akay, B.: A comparative study of artificial bee colony algorithm. Appl. Math. Comput. 214(1), 108–132 (2009)
MathSciNet MATH Google Scholar
Karaboga, D., Akay, B.: A survey: Algorithms simulating bee swarm intelligence. Artif. Intell. Rev. 31(1–4), 61–85 (2009)
Article Google Scholar
Karaboga, D., Gorkemli, B., Ozturk, C., Karaboga, N.: A comprehensive survey: artificial bee colony (abc) algorithm and applications. Artif. Intell. Rev. 42(1), 21–57 (2014)
Article Google Scholar
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: A machine learning approach to building domain-specific search engines. In: IJCAI, vol. 99, pp. 662–667. Citeseer (1999)
Google Scholar
Najork, M.: Web crawler architecture. In: Encyclopedia of Database Systems, pp. 3462–3465. Springer (2009)
Google Scholar
Pappas, N., Katsimpras, G., Stamatatos, E.: An agent-based focused crawling framework for topic-and genre-related web document discovery. In: IEEE 24th International Conference on Tools with Artificial Intelligence, vol. 1, pp. 508–515. IEEE (2012)
Google Scholar
Zheng, S., Dmitriev, P., Giles, C.L.: Graph-based seed selection for web-scale crawlers. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1967–1970. ACM (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

International Institute of Information Technology, Gachibowli, Hyderabad, India
Lalit Mohan Sanagavarapu, Sourav Sarangi & Y. Raghu Reddy

Authors

Lalit Mohan Sanagavarapu
View author publications
You can also search for this author in PubMed Google Scholar
Sourav Sarangi
View author publications
You can also search for this author in PubMed Google Scholar
Y. Raghu Reddy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lalit Mohan Sanagavarapu .

Editor information

Editors and Affiliations

Universidad de Alicante, Alicante, Spain
Irene Garrigós
Institute of Software Technology and Interactive Systems, TU Wien, Vienna, Austria
Manuel Wimmer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sanagavarapu, L.M., Sarangi, S., Reddy, Y.R. (2018). ABC Algorithm for URL Extraction. In: Garrigós, I., Wimmer, M. (eds) Current Trends in Web Engineering. ICWE 2017. Lecture Notes in Computer Science(), vol 10544. Springer, Cham. https://doi.org/10.1007/978-3-319-74433-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-74433-9_12
Published: 22 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-74432-2
Online ISBN: 978-3-319-74433-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics