Skip to main content

Web Archive Profiling Through Fulltext Search

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9819))

Abstract

An archive profile is a high-level summary of a web archive’s holdings that can be used for routing Memento queries to the appropriate archives. It can be created by generating summaries from the CDX files (index of web archives) which we explored in an earlier work. However, requiring archives to update their profiles periodically is difficult. Alternative means to discover the holdings of an archive involve sampling based approaches such as fulltext keyword searching to learn the URIs present in the response or looking up for a sample set of URIs and see which of those are present in the archive. It is the fulltext search based discovery and profiling that is the scope of this paper. We developed the Random Searcher Model (RSM) to discover the holdings of an archive by a random search walk. We measured the search cost of discovering certain percentages of the archive holdings for various profiling policies under different RSM configurations. We can make routing decisions of 80 % of the requests correctly while maintaining about 0.9 recall by discovering only 10 % of the archive holdings and generating a profile that costs less than 1 % of the complete knowledge profile.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://labs.mementoweb.org/aggregator_config/archivelist.xml.

  2. 2.

    http://timetravel.mementoweb.org/.

  3. 3.

    http://oldweb.today/.

  4. 4.

    http://archive.org/web/researcher/cdx_file_format.php.

  5. 5.

    https://www.archive-it.org/collections/194.

  6. 6.

    http://worddetail.org/most_common/nouns.

  7. 7.

    https://github.com/oduwsdl/archive_profiler.

References

  1. Alam, S., Kreymer, I., Nelson, M.L.: Object Resource Stream (ORS) and CDX-JSON (CDXJ) Draft (2015). https://github.com/oduwsdl/ORS

  2. Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L.L., Shankar, H., Rosenthal, D.S.H.: Web archive profiling through CDX summarization. In: Kapidakis, S., et al. (eds.) TPDL 2015. LNCS, vol. 9316, pp. 3–14. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24592-8_1

    Chapter  Google Scholar 

  3. AlSum, A., Weigle, M.C., Nelson, M.L., Van de Sompel, H.: Profiling web archive coverage for top-level domain and content language. Int. J. Digit. Libr. 14(3–4), 149–166 (2014)

    Article  Google Scholar 

  4. Blum, A., Chan, T.H., Rwebangira, M.R.: A random-surfer web-graph model. In: Proceedings of the Meeting on Analytic Algorithmics and Combinatorics, ANALCO 2006, pp. 238–246. Society for Industrial and Applied Mathematics (2006)

    Google Scholar 

  5. Bornand, N., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2016 (2016)

    Google Scholar 

  6. Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)

    Article  Google Scholar 

  7. Gravano, L., Chang, C.C.K., García-Molina, H., Paepcke, A.: STARTS: stanford proposal for internet meta-searching. SIGMOD Rec. 26(2), 207–218 (1997)

    Article  Google Scholar 

  8. Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions (1996)

    Google Scholar 

  9. Liu, L.: Query routing in large-scale digital library systems. In: Proceedings of 15th International Conference on Data Engineering, pp. 154–163 (1999)

    Google Scholar 

  10. Meng, W., Yu, C., Liu, K.L.: Building efficient and effective metasearch engines. ACM Comput. Surv. (CSUR) 34(1), 48–89 (2002)

    Article  Google Scholar 

  11. Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2005, pp. 100–109 (2005)

    Google Scholar 

  12. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. Technical report 2000-36, Stanford InfoLab (2000). http://ilpubs.stanford.edu:8090/456/

  13. Sanderson, R.: Global web archive integration with memento. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 379–380. ACM (2012)

    Google Scholar 

  14. Sanderson, R., Van de Sompel, H., Nelson, M.L.: IIPC Memento Aggregator Experiment (2012). http://www.netpreserve.org/sites/default/files/resources/Sanderson.pdf

  15. Sugiura, A., Etzioni, O.: Query routing for web search engines: architecture and experiments. Comput. Netw. 33(1), 417–429 (2000)

    Article  Google Scholar 

  16. Tran, T., Zhang, L.: Keyword query routing. IEEE Trans. Knowl. Data Eng. 26(2), 363–375 (2014)

    Article  Google Scholar 

  17. Van de Sompel, H., Nelson, M.L., Sanderson, R.: HTTP Framework for Time-Based Access to Resource States - Memento. RFC 7089 (2013)

    Google Scholar 

  18. Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, pp. 47–47 (2006)

    Google Scholar 

  19. Xu, J., Callan, J.: Effective retrieval with distributed collections. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 112–120. ACM (1998)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by the International Internet Preservation Consortium (IIPC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sawood Alam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Alam, S., Nelson, M.L., Van de Sompel, H., Rosenthal, D.S.H. (2016). Web Archive Profiling Through Fulltext Search. In: Fuhr, N., Kovács, L., Risse, T., Nejdl, W. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2016. Lecture Notes in Computer Science(), vol 9819. Springer, Cham. https://doi.org/10.1007/978-3-319-43997-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-43997-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-43996-9

  • Online ISBN: 978-3-319-43997-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics