skip to main content
10.1145/3091478.3091500acmconferencesArticle/Chapter ViewAbstractPublication PageswebsciConference Proceedingsconference-collections
research-article

Exploring Web Archives Through Temporal Anchor Texts

Published: 25 June 2017 Publication History

Abstract

Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives.
In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

References

[1]
Omar Alonso, Michael Gertz, and Ricardo Baeza-Yates. 2007. On the Value of Temporal Information in Information Retrieval. SIGIR Forum 41, 2 (Dec. 2007), 35--41.
[2]
Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. 2011. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval.
[3]
Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. 2012. Index maintenance for time-travel text search. In Proceedings of the 35th international ACM SIGIR conference on Research and development in Information Retrieval.
[4]
Klaus Berberich and Srikanta Bedathur. 2013. Temporal diversification of search results. In SIGIR 2013 Workshop on Time-aware Information Access (TAIA 2013).
[5]
Klaus Berberich, Srikanta Bedathur, Omar Alonso, and Gerhard Weikum. 2010. A Language Modeling Approach for Temporal Information Needs. In Proceedings of the 32Nd European Conference on Advances in Information Retrieval (ECIR'2010). Springer-Verlag, Berlin, Heidelberg, 13--25.
[6]
Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM, 3--10.
[7]
Ricardo Campos, García Dias, Arapio M. Jorge, and Adam Jatowt. 2014. Survey of Temporal Information Retrieval and Related Applications. ACM Comput. Surv. 47, 2 (Aug. 2014), 15:1--15:41.
[8]
Miguel Costa, Daniel Gomes, Francisco Couto, and Mário Silva. 2013. A Survey of Web Archive Search Architectures. In Proceedings of the 22nd International Conference on World Wide Web (Companion).
[9]
Miguel Costa and Mário J Silva. 2010. Understanding the Information Needs of Web Archive Users . In Proceedings of the 10th International Web Archiving Workshop.
[10]
Nick Craswell, David Hawking, and Stephen Robertson. 2001. Effective Site Finding using Link Anchor Information. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM.
[11]
German Federal Statistical Office (Statistisches Bundesamt, Destatis). 2011. Fast zehn Jahre Euro - Preisentwicklung vor und nach der Bargeldeinrung. Article number: 5611105119004 (Decenber 2011). https://www.destatis.de/DE/Publikationen/Thematisch/Preise/Verbraucherpreise/Fast10JahreEuro5611105119004.html {Accessed: 16/03/2017}.
[12]
Vinay Goel. 2016. Beta Wayback Machine - Now with Site Search! (October 2016). https://blog.archive.org/2016/10/24/beta-wayback-machine-now-with-site-search {Accessed: 16/03/2017}.
[13]
Wendy Hall, Jim Hendler, and Steffen Staab. 2017. A Manifesto for Web Science @10. arXiv:1702.08291 (2017).
[14]
Helge Holzmann and Avishek Anand. 2016. Tempas: Temporal Archive Search Based on Tags. In Proceedings of the 25th International Conference Companion on World Wide Web.
[15]
Helge Holzmann, Vinay Goel, and Avishek Anand. 2016. ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation. In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries.
[16]
Helge Holzmann, Wolfgang Nejdl, and Avishek Anand. 2016. On the Applicability of Delicious for Temporal Search on Web Archives. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval.
[17]
Rosie Jones and Fernando Diaz. 2007. Temporal Profiles of Queries. ACM Trans. Inf. Syst. 25, 3 (July 2007).
[18]
Nattiya Kanhabua and Wolfgang Nejdl. 2014. On the Value of Temporal Anchor Texts in Wikipedia. In SIGIR 2014 Workshop on Temporal, Social and Spatiallyaware Information Access (TAIA).
[19]
Marijn Koolen and Jaap Kamps. 2010. The importance of anchor text for ad hoc search revisited. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 122--129.
[20]
Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra. 2002. The Importance of Prior Probabilities for Entry Page Search. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM.
[21]
Paul Ogilvie and Jamie Callan. 2003. Combining Document Representations for Known-Item Search. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM.
[22]
Jaspreet Singh, Wolfgang Nejdl, and Avishek Anand. 2016. History by Diversity: Helping Historians Search News Archives. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval (CHIIR '16). ACM, New York, NY, USA, 183--192.

Cited By

View all
  • (2023)Detecting Phishing Websites: Leveraging RNNs and Domain-Specific Features for Enhanced Accuracy2023 International Conference on Computational Intelligence, Networks and Security (ICCINS)10.1109/ICCINS58907.2023.10450062(01-07)Online publication date: 22-Dec-2023
  • (2023)Is this news article still relevant? Ranking by contemporary relevance in archival searchInternational Journal on Digital Libraries10.1007/s00799-023-00377-y25:2(197-216)Online publication date: 28-Jul-2023
  • (2022)Extractive Explanations for Interpretable Text RankingACM Transactions on Information Systems10.1145/357692441:4(1-31)Online publication date: 16-Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WebSci '17: Proceedings of the 2017 ACM on Web Science Conference
June 2017
438 pages
ISBN:9781450348966
DOI:10.1145/3091478
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data analysis
  2. temporal information retrieval
  3. web archives

Qualifiers

  • Research-article

Funding Sources

  • European Research Council

Conference

WebSci '17
Sponsor:
WebSci '17: ACM Web Science Conference
June 25 - 28, 2017
New York, Troy, USA

Acceptance Rates

WebSci '17 Paper Acceptance Rate 30 of 85 submissions, 35%;
Overall Acceptance Rate 245 of 933 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Detecting Phishing Websites: Leveraging RNNs and Domain-Specific Features for Enhanced Accuracy2023 International Conference on Computational Intelligence, Networks and Security (ICCINS)10.1109/ICCINS58907.2023.10450062(01-07)Online publication date: 22-Dec-2023
  • (2023)Is this news article still relevant? Ranking by contemporary relevance in archival searchInternational Journal on Digital Libraries10.1007/s00799-023-00377-y25:2(197-216)Online publication date: 28-Jul-2023
  • (2022)Extractive Explanations for Interpretable Text RankingACM Transactions on Information Systems10.1145/357692441:4(1-31)Online publication date: 16-Dec-2022
  • (2022)User Access Models to Event-Centric InformationCompanion Proceedings of the Web Conference 202210.1145/3487553.3524193(329-333)Online publication date: 25-Apr-2022
  • (2021)A Semantic Layer Querying ToolProceedings of the 14th ACM International Conference on Web Search and Data Mining10.1145/3437963.3441710(1101-1104)Online publication date: 8-Mar-2021
  • (2021)Estimating Contemporary Relevance of Past News2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)10.1109/JCDL52503.2021.00019(70-79)Online publication date: Sep-2021
  • (2021)Efficient Scalable Temporal Web Graph Store2021 IEEE International Conference on Big Data (Big Data)10.1109/BigData52589.2021.9671984(263-273)Online publication date: 15-Dec-2021
  • (2021)A Holistic View on Web ArchivesThe Past Web10.1007/978-3-030-63291-5_8(85-99)Online publication date: 1-Jul-2021
  • (2019)Towards temporal URI collections for named entitiesProceedings of the 18th Joint Conference on Digital Libraries10.1109/JCDL.2019.00-68(241-250)Online publication date: 2-Jun-2019
  • (2019)Estimating PageRank deviations in crawled graphsApplied Network Science10.1007/s41109-019-0201-94:1Online publication date: 22-Oct-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media