Adaptive Focused Crawling of Linked Data

Yu, Ran; Gadiraju, Ujwal; Fetahu, Besnik; Dietze, Stefan

doi:10.1007/978-3-319-26190-4_37

Ran Yu²⁰,
Ujwal Gadiraju²⁰,
Besnik Fetahu²⁰ &
…
Stefan Dietze²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9418))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1535 Accesses
3 Citations

Abstract

Given the evolution of publicly available Linked Data, crawling and preservation have become increasingly important challenges. Due to the scale of available data on the Web, efficient focused crawling approaches which are able to capture the relevant semantic neighborhood of seed entities are required. Here, determining relevant entities for a given set of seed entities is a crucial problem. While the weight of seeds within a seed list vary significantly with respect to the crawl intent, we argue that an adaptive crawler is required, which considers such characteristics when configuring the crawling and relevance detection approach. To address this problem, we introduce a crawling configuration, which considers seed list-specific features as part of its crawling and ranking algorithm. We evaluate it through extensive experiments in comparison to a number of baseline methods and crawling parameters. We demonstrate that, configurations which consider seed list features outperform the baselines and present further insights gained from our experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In this case we pooled the \(Top-500\) entities resulting from all the different configurations for each seed list.

References

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Chapter Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998)
Article Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, WWW, pp. 148–159. ACM, New York (2002)
Google Scholar
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)
Article Google Scholar
De Bra, P., Houben, G.-J., Kornatzky, Y., Post, R.: Information retrieval in distributed hypertexts. In: RIAO, pp. 481–493 (1994)
Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M., et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)
Google Scholar
Fetahu, B., Gadiraju, U., Dietze, S.: Crawl me maybe: iterative linked dataset preservation. In: Proceedings of the 13th International Semantic Web Conference (ISWC) Posters & Demonstrations Track, pp. 433–436 (2014)
Google Scholar
Fetahu, B., Gadiraju, U., Dietze, S.: Improving entity retrieval on structured data. In: Proceedings of the 14th International Semantic Web Conference. Springer (2015)
Google Scholar
Gadiraju, U., Demartini, G., Kawase, R., Dietze, S.: Human beyond the machine: challenges and opportunities of microtask crowdsourcing. IEEE Intell. Syst. 30(4), 81–85 (2015)
Article Google Scholar
Gadiraju, U., Kawase, R., Dietze, S., Demartini, G.: Understanding malicious behaviour in crowdsourcing platforms: the case of online surveys. In: Proceedings of CHI 2015 (2015)
Google Scholar
Isele, R., Umbrich, J., Bizer, C., Harth, A.: Ldspider: an open-source crawling framework for the web of linked data. In 9th International Semantic Web Conference, ISWC. Citeseer (2010)
Google Scholar
Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953)
Article MATH Google Scholar
McCallumzy, A., Nigamy, K., Renniey, J., Seymorey, K.: Building domain-specific search engines with machine learning techniques (1999)
Google Scholar
Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, CIKM, pp. 1039–1048 (2014)
Google Scholar
Pereira Nunes, B., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W.: Combining a co-occurrence-based and a semantic measure for entity linking. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 548–562. Springer, Heidelberg (2013)
Chapter Google Scholar
Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) WWW, pp. 771–780. ACM (2010)
Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)
Google Scholar
Tang, T.T., Hawking, D., Craswell, N., Griffiths, K.: Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 147–154. ACM (2005)
Google Scholar
Von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 319–326. ACM (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

L3S Research Center, Leibniz Universität Hannover, Hannover, Germany
Ran Yu, Ujwal Gadiraju, Besnik Fetahu & Stefan Dietze

Authors

Ran Yu
View author publications
You can also search for this author in PubMed Google Scholar
Ujwal Gadiraju
View author publications
You can also search for this author in PubMed Google Scholar
Besnik Fetahu
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Dietze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ran Yu .

Editor information

Editors and Affiliations

Tsinghua University, Bijing, China
Jianyong Wang
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
Florida Atlantic University, Boca Raton, Florida, USA
Dingding Wang
Victoria University, Melbourne, Australia
Hua Wang
School of Computing & Information, Florida International University, Miami, Florida, USA
Shu-Ching Chen
Florida International University, Miami, Florida, USA
Tao Li
Victoria University, Melbourne, Victoria, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, R., Gadiraju, U., Fetahu, B., Dietze, S. (2015). Adaptive Focused Crawling of Linked Data. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9418. Springer, Cham. https://doi.org/10.1007/978-3-319-26190-4_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-26190-4_37
Published: 25 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26189-8
Online ISBN: 978-3-319-26190-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics