Skip to main content

Immediate Text Search on Streams Using Apoptosic Indexes

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2022)

Abstract

Applications that involve streams of documents require a mechanism for search over the newest arrivals. In this paper we explore provision of immediate indexing and fast search of recent documents only, in contrast to focus on dynamic construction of an index of all observed material. Our contribution is a new structure, an apoptosic index, that operates in a fixed volume of memory and in which expired index entries vanish without significant overhead; there is neither explicit removal of old data nor explicit memory management. We demonstrate the practicality of apoptosic indexes with a straightforward implementation and experiments on microblog and newswire data, showing dramatically faster performance than observed with alternatives.

J. Zobel—This research was supported by the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (project DP190102078). The views expressed herein are those of the authors and are not necessarily those of the Australian Government or Australian Research Council.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://developer.twitter.com/en/docs/twitter-api/tweets/sampled-stream.

  2. 2.

    http://anserini.io/.

  3. 3.

    https://lucene.apache.org/.

  4. 4.

    http://pyserini.io/.

References

  1. Assaf, E., Ben-Basat, R., Einziger, G., Friedman, R.: Pay for a sliding bloom filter and get counting, distinct elements, and entropy for free. In: INFOCOM (2018)

    Google Scholar 

  2. Broder, A.Z.: On the resemblance and containment of documents. In: SEQUENCES (1997)

    Google Scholar 

  3. Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., Lin, J.J.: Earlybird: real-time search at Twitter. In: ICDE (2012)

    Google Scholar 

  4. Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval - Implementing and Evaluating Search Engines. MIT Press, Cambridge (2010)

    Google Scholar 

  5. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC (2002)

    Google Scholar 

  6. Chen, C., Li, F., Ooi, B.C., Wu, S.: TI: an efficient indexing mechanism for real-time search on tweets. In: SIGMOD (2011)

    Google Scholar 

  7. Graff, D.: The AQUAINT corpus of English news text LDC2002T31 (2002)

    Google Scholar 

  8. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Vitter, J.S. (ed.) STOC (1998)

    Google Scholar 

  9. Kraus, N., Carmel, D., Keidar, I.: Fishing in the stream: similarity search over endless data. In: BigData (2017)

    Google Scholar 

  10. Lester, N., Moffat, A., Zobel, J.: Fast on-line index construction by geometric partitioning. In: CIKM (2005)

    Google Scholar 

  11. Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Agarwal, R.C.: Dynamic maintenance of web indexes using landmarks. In: WWW (2003)

    Google Scholar 

  12. Lin, J., et al.: Toward reproducible baselines: the open-source IR reproducibility challenge. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 408–420. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_30

    Chapter  Google Scholar 

  13. Magdy, A., Mokbel, M.F., Elnikety, S., Nath, S., He, Y.: Mercury: a memory-constrained spatio-temporal real-time search on microblogs. In: ICDE (2014)

    Google Scholar 

  14. McCreadie, R., Soboroff, I., Lin, J.J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable Twitter corpus. In: SIGIR (2012)

    Google Scholar 

  15. Mishne, G., Dalton, J., Li, Z., Sharma, A., Lin, J.J.: Fast data in the era of big data: Twitter’s real-time related query suggestion architecture. In: SIGMOD (2013)

    Google Scholar 

  16. O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (LSM-tree). Acta Informatica 33(4), 351–385 (1996)

    Article  Google Scholar 

  17. Petrovic, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to Twitter. In: NAACL-HLT (2010)

    Google Scholar 

  18. Sundaram, N., et al.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow. 6(14), 1930–1941 (2013)

    Article  Google Scholar 

  19. Teevan, J., Ramage, D., Morris, M.R.: #TwitterSearch: a comparison of microblog search and web search. In: WSDM (2011)

    Google Scholar 

  20. Wang, Y., Lin, J.: The feasibility of brute force scans for real-time tweet search. In: ICTIR (2015)

    Google Scholar 

  21. Wu, L., Lin, W., Xiao, X., Xu, Y.: LSII: an indexing structure for exact real-time search on microblogs. In: ICDE (2013)

    Google Scholar 

  22. Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of Lucene for information retrieval research. In: SIGIR (2017)

    Google Scholar 

  23. Yang, P., Fang, H., Lin, J.: Anserini: reproducible ranking baselines using Lucene. ACM J. Data Inf. Qual. 10(4), 1–20 (2018)

    Article  Google Scholar 

  24. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrick Eades .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Eades, P., Wirth, A., Zobel, J. (2022). Immediate Text Search on Streams Using Apoptosic Indexes. In: Hagen, M., et al. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13185. Springer, Cham. https://doi.org/10.1007/978-3-030-99736-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-99736-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-99735-9

  • Online ISBN: 978-3-030-99736-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics