Abstract
Applications that involve streams of documents require a mechanism for search over the newest arrivals. In this paper we explore provision of immediate indexing and fast search of recent documents only, in contrast to focus on dynamic construction of an index of all observed material. Our contribution is a new structure, an apoptosic index, that operates in a fixed volume of memory and in which expired index entries vanish without significant overhead; there is neither explicit removal of old data nor explicit memory management. We demonstrate the practicality of apoptosic indexes with a straightforward implementation and experiments on microblog and newswire data, showing dramatically faster performance than observed with alternatives.
J. Zobel—This research was supported by the Australian Government through the Australian Research Council’s Discovery Projects funding scheme (project DP190102078). The views expressed herein are those of the authors and are not necessarily those of the Australian Government or Australian Research Council.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Assaf, E., Ben-Basat, R., Einziger, G., Friedman, R.: Pay for a sliding bloom filter and get counting, distinct elements, and entropy for free. In: INFOCOM (2018)
Broder, A.Z.: On the resemblance and containment of documents. In: SEQUENCES (1997)
Busch, M., Gade, K., Larson, B., Lok, P., Luckenbill, S., Lin, J.J.: Earlybird: real-time search at Twitter. In: ICDE (2012)
Büttcher, S., Clarke, C.L.A., Cormack, G.V.: Information Retrieval - Implementing and Evaluating Search Engines. MIT Press, Cambridge (2010)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC (2002)
Chen, C., Li, F., Ooi, B.C., Wu, S.: TI: an efficient indexing mechanism for real-time search on tweets. In: SIGMOD (2011)
Graff, D.: The AQUAINT corpus of English news text LDC2002T31 (2002)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Vitter, J.S. (ed.) STOC (1998)
Kraus, N., Carmel, D., Keidar, I.: Fishing in the stream: similarity search over endless data. In: BigData (2017)
Lester, N., Moffat, A., Zobel, J.: Fast on-line index construction by geometric partitioning. In: CIKM (2005)
Lim, L., Wang, M., Padmanabhan, S., Vitter, J.S., Agarwal, R.C.: Dynamic maintenance of web indexes using landmarks. In: WWW (2003)
Lin, J., et al.: Toward reproducible baselines: the open-source IR reproducibility challenge. In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 408–420. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30671-1_30
Magdy, A., Mokbel, M.F., Elnikety, S., Nath, S., He, Y.: Mercury: a memory-constrained spatio-temporal real-time search on microblogs. In: ICDE (2014)
McCreadie, R., Soboroff, I., Lin, J.J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable Twitter corpus. In: SIGIR (2012)
Mishne, G., Dalton, J., Li, Z., Sharma, A., Lin, J.J.: Fast data in the era of big data: Twitter’s real-time related query suggestion architecture. In: SIGMOD (2013)
O’Neil, P.E., Cheng, E., Gawlick, D., O’Neil, E.J.: The log-structured merge-tree (LSM-tree). Acta Informatica 33(4), 351–385 (1996)
Petrovic, S., Osborne, M., Lavrenko, V.: Streaming first story detection with application to Twitter. In: NAACL-HLT (2010)
Sundaram, N., et al.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proc. VLDB Endow. 6(14), 1930–1941 (2013)
Teevan, J., Ramage, D., Morris, M.R.: #TwitterSearch: a comparison of microblog search and web search. In: WSDM (2011)
Wang, Y., Lin, J.: The feasibility of brute force scans for real-time tweet search. In: ICTIR (2015)
Wu, L., Lin, W., Xiao, X., Xu, Y.: LSII: an indexing structure for exact real-time search on microblogs. In: ICDE (2013)
Yang, P., Fang, H., Lin, J.: Anserini: enabling the use of Lucene for information retrieval research. In: SIGIR (2017)
Yang, P., Fang, H., Lin, J.: Anserini: reproducible ranking baselines using Lucene. ACM J. Data Inf. Qual. 10(4), 1–20 (2018)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput. Surv. 38(2), 6 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Eades, P., Wirth, A., Zobel, J. (2022). Immediate Text Search on Streams Using Apoptosic Indexes. In: Hagen, M., et al. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13185. Springer, Cham. https://doi.org/10.1007/978-3-030-99736-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-99736-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99735-9
Online ISBN: 978-3-030-99736-6
eBook Packages: Computer ScienceComputer Science (R0)