Abstract
An archive profile is a high-level summary of a web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for a sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by a random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80 % of the requests correctly while maintaining about 0.9 recall by discovering only 10 % of the archive holdings and generating a profile that costs less than 1 % of the complete knowledge profile.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
References
Alam, S., Kreymer, I., Nelson, M.L.: Object Resource Stream (ORS) and CDX-JSON (CDXJ) Draft (2015). https://github.com/oduwsdl/ORS
Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L.L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Kapidakis, S., et al. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 3–14. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24592-8_1
AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)
Blum, A., Chan, T.H., Rwebangira, M.R.: A random-surfer web-graph model. In: Proceedings of the Meeting on Analytic Algorithmics and Combinatorics, ANALCO 2006, pp. 238–246. Society for Industrial and Applied Mathematics (2006)
Bornand, N., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 (2016)
Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)
Gravano, L., Chang, C.C.K., García-Molina, H., Paepcke, A.: STARTS: stanford proposal for internet meta-searching. SIGMOD Rec. 26(2), 207–218 (1997)
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions (1996)
Liu, L.: Query routing in large-scale digital library systems. In: Proceedings of 15th International Conference on Data Engineering, pp. 154–163 (1999)
Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. (CSUR) 34(1), 48–89 (2002)
Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2005, pp. 100–109 (2005)
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. Technical report 2000-36, Stanford InfoLab (2000). http://ilpubs.stanford.edu:8090/456/
Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 379–380. ACM (2012)
Sanderson, R., Van de Sompel, H., Nelson, M.L.: IIPC Memento Aggregator Experiment (2012). http://www.netpreserve.org/sites/default/files/resources/Sanderson.pdf
Sugiura, A., Etzioni, O.: Query routing for web search engines: architecture and experiments. Comput. Netw. 33(1), 417–429 (2000)
Tran, T., Zhang, L.: Keyword query routing. IEEE Trans. Knowl. Data Eng. 26(2), 363–375 (2014)
Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP Framework for Time-Based Access to Resource States - Memento. RFC 7089 (2013)
Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 47–47 (2006)
Xu, J., Callan, J.: Effective retrieval with distributed collections. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 112–120. ACM (1998)
Acknowledgements
This work is supported in part by the International Internet Preservation Consortium (IIPC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Alam, S., Nelson, M.L., Van de Sompel, H., Rosenthal, D.S.H. (2016). Web Archive Profiling Through Fulltext Search. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-43997-6_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-43996-9
Online ISBN: 978-3-319-43997-6
eBook Packages: Computer ScienceComputer Science (R0)