Web Archive Profiling Through Fulltext Search

Alam, Sawood; Nelson, Michael L.; Van de Sompel, Herbert; Rosenthal, David S. H.

doi:10.1007/978-3-319-43997-6_10

Web Archive Profiling Through Fulltext Search

Sawood Alam¹⁷,
Michael L. Nelson¹⁷,
Herbert Van de Sompel¹⁸ &
…
David S. H. Rosenthal¹⁹

Conference paper
First Online: 10 August 2016

1583 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Abstract

An archive profile is a high-level summary of a web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for a sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by a random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80 % of the requests correctly while maintaining about 0.9 recall by discovering only 10 % of the archive holdings and generating a profile that costs less than 1 % of the complete knowledge profile.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Alam, S., Kreymer, I., Nelson, M.L.: Object Resource Stream (ORS) and CDX-JSON (CDXJ) Draft (2015). https://github.com/oduwsdl/ORS
Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L.L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Kapidakis, S., et al. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 3–14. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24592-8_1
Chapter Google Scholar
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)
Article Google Scholar
Blum, A., Chan, T.H., Rwebangira, M.R.: A random-surfer web-graph model. In: Proceedings of the Meeting on Analytic Algorithmics and Combinatorics, ANALCO 2006, pp. 238–246. Society for Industrial and Applied Mathematics (2006)
Google Scholar
Bornand, N., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 (2016)
Google Scholar
Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)
Article Google Scholar
Gravano, L., Chang, C.C.K., García-Molina, H., Paepcke, A.: STARTS: stanford proposal for internet meta-searching. SIGMOD Rec. 26(2), 207–218 (1997)
Article Google Scholar
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions (1996)
Google Scholar
Liu, L.: Query routing in large-scale digital library systems. In: Proceedings of 15th International Conference on Data Engineering, pp. 154–163 (1999)
Google Scholar
Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. (CSUR) 34(1), 48–89 (2002)
Article Google Scholar
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2005, pp. 100–109 (2005)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. Technical report 2000-36, Stanford InfoLab (2000). http://ilpubs.stanford.edu:8090/456/
Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 379–380. ACM (2012)
Google Scholar
Sanderson, R., Van de Sompel, H., Nelson, M.L.: IIPC Memento Aggregator Experiment (2012). http://www.netpreserve.org/sites/default/files/resources/Sanderson.pdf
Sugiura, A., Etzioni, O.: Query routing for web search engines: architecture and experiments. Comput. Netw. 33(1), 417–429 (2000)
Article Google Scholar
Tran, T., Zhang, L.: Keyword query routing. IEEE Trans. Knowl. Data Eng. 26(2), 363–375 (2014)
Article Google Scholar
Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP Framework for Time-Based Access to Resource States - Memento. RFC 7089 (2013)
Google Scholar
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 47–47 (2006)
Google Scholar
Xu, J., Callan, J.: Effective retrieval with distributed collections. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 112–120. ACM (1998)
Google Scholar

Download references

Acknowledgements

This work is supported in part by the International Internet Preservation Consortium (IIPC).

Author information

Authors and Affiliations

Computer Science Department, Old Dominion University, Norfolk, VA, USA
Sawood Alam & Michael L. Nelson
Los Alamos National Laboratory, Los Alamos, NM, USA
Herbert Van de Sompel
Stanford University Libraries, Stanford, CA, USA
David S. H. Rosenthal

Authors

Sawood Alam
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Nelson
View author publications
You can also search for this author in PubMed Google Scholar
Herbert Van de Sompel
View author publications
You can also search for this author in PubMed Google Scholar
David S. H. Rosenthal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sawood Alam .

Editor information

Editors and Affiliations

Universität Duisburg-Essen , Duisburg, Germany
Norbert Fuhr
Hungarian Academy of Science , Budapest, Hungary
László Kovács
Leibniz Universität Hannover , Hannover, Germany
Thomas Risse
Leibniz Universität Hannover , Hannover, Germany
Wolfgang Nejdl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alam, S., Nelson, M.L., Van de Sompel, H., Rosenthal, D.S.H. (2016). Web Archive Profiling Through Fulltext Search. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-43997-6_10
Published: 10 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics