Abstract
We consider a variant of the “string searching in database” problem where the string database comes on a data stream, and processing the data is at a premium but querying is not a runtime bottleneck. Speci.cally, the strings to be searched into (let’s call them the documents) have to be processed online very e.ciently, meaning the documents have to be added to some string searching data structure one by one in time proportional to their length. Of course, we desire this data structure to be small, i.e. at most linear space, and hopefully exhibit a tradeo. between storage/processing cost and accuracy. Upon some query string, the data structure must return whether that string is contained in a document (the presence query), and must also be able to return a list of the documents which contain the query (the attribution query). We may require that the query be large enough and that only portions of it may match (pattern matching). In practice, it is acceptable that the data structure return a superset of the answer, as long as no document from the answer is missing and there are only few false positives; either the false positives can be .ltered (by actual veri.cation if the document texts are available in a repository), or a small number of false positives are acceptable for the application (e.g. network forensics, see below).
This research is supported by NSF CyberTrust Grant 0430444, “Fornet: :Design and Implementation of a Network Forensics System”.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bloom, B.: Space/time tradeoffs in hash coding with allowable errors. Communnications of the ACM 13(7), 422–426 (1970)
Broder, A., Mitzenmatcher, M.: Network applications of Bloom filters: A survey. In: Annual Allerton Conference on Communication, Control, and Computing, pp. 636–646 (2002)
Cao, P.: Bloom filters - the math, http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Chazelle, B., Kilian, J., Rubinfeld, R., Tal, A.: The Bloomier filter: An efficient data structure for static support lookup tables. In: Proc. ACM/SIAM Symposium on Discrete Algorithms, pp. 30–39 (2004)
Cohen, S., Matias, Y.: Spectral Bloom filters. In: Proc. ACM SIGMOD International Conference on Management of Data, pp. 241–252 (2003)
Demaine, E.D., Lopez-Ortiz, A.: A linear lower bound on index size for text retrieval. Journal of Algorithms 48(1), 2–15 (2003); Special issue of selected papers from the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2001)
Dharmapurikar, S., Attig, M., Lockwood, J.: Design and implementation of a string matching system for network intrusion detection using fpga-based bloom filters. Technical Report, CSE Dept, Washington University, Saint Louis, MO (2004)
Fan, L., Cao, P., Almeida, J., Broder, A.Z.: Summary cache: A scalable wide-area web cache sharing protocol. IEEE /ACM Transactions on Networking 8(3), 281–293 (2000)
Kumar, A., Li, L., Wang, J.: Space-code bloom filter for efficient traffic flow measurement. In: Proc. of the Conference on Internet Measurement, Miami Beach, FL, USA, pp. 167–172 (2003)
Manber, U.: Finding similar files in a large file system. In: Proc. of the Winter 1994 USENIX Conference, San Francisco, CA, pp. 1–10 (1994)
Mitzenmacher, M.: Compressed Bloom filters. IEEE/ACM Transactions on Networking 10(5), 613–620 (2002)
Rhea, S.C., Liang, K., Brewer, E.: Value-based web caching. In: Proc. 12th International Conference on World Wide Web, pp. 619–628. ACM Press, New York (2003)
Shanmugasundaram, K., Brönnimann, H., Memon, N.: Payload attribution via hierarchical bloom filters. In: Proc. of the ACM Conference on Computer Communications and Security, pp. 31–41 (2004)
Shanmugasundaram, K., Memon, N., Savant, A., Brönnimann, H.: Fornet: A distributed forensics network. In: Proc. of MMM-ACNS Workshop, pp. 1–16 (2003)
Snoeren, A.C., Partridge, C., Sanchez, L.A., Jones, C.E., Tchakountio, F., Kent, S.T., Strayer, W.T.: Single-packet IP traceback. IEEE/ACM Transactions on Networking 10(6), 721–734 (2002)
Spring, N.T., Wetherall, D.: A protocol-independent technique for eliminating redundant network traffic. In: Proc. of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, pp. 87–95. ACM Press, New York (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brönnimann, H., Memon, N., Shanmugasundaram, K. (2005). String Matching on the Internet. In: López-Ortiz, A., Hamel, A.M. (eds) Combinatorial and Algorithmic Aspects of Networking. CAAN 2004. Lecture Notes in Computer Science, vol 3405. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527954_8
Download citation
DOI: https://doi.org/10.1007/11527954_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27873-3
Online ISBN: 978-3-540-31860-6
eBook Packages: Computer ScienceComputer Science (R0)