Abstract
String similarity search is required by many real-life applications, such as spell checking, data cleansing, fuzzy keyword search, or comparison of DNA sequences. Given a very large string set and a query string, the string similarity search problem is to efficiently find all strings in the string set that are similar to the query string. Similarity is defined using a similarity (or distance) measure, such as edit distance or Hamming distance. In this paper, we introduce the State Set Index (SSI) as an efficient solution for this search problem.
SSI is based on a trie (prefix index) that is interpreted as a nondeterministic finite automaton. SSI implements a novel state labeling strategy making the index highly space-efficient. Furthermore, SSI’s space consumption can be gracefully traded against search time.
We evaluated SSI on different sets of person names with up to 170 million strings from a social network and compared it to other state-of-the-art methods. We show that in the majority of cases, SSI is significantly faster than other tools and requires less index space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aghili, S.A., Agrawal, D.P., El Abbadi, A.: BFT: Bit Filtration Technique for Approximate String Join in Biological Databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 326–340. Springer, Heidelberg (2003)
Behm, A., Vernica, R., Alsubaiee, S., Ji, S., Lu, J., Jin, L., Lu, Y., Li, C.: UCI Flamingo Package 4.0 (2011)
Bocek, T., Hunt, E., Stiller, B.: Fast Similarity Search in Large Dictionaries. Technical report, Department of Informatics, University of Zurich (2007)
Celikik, M., Bast, H.: Fast error-tolerant search on very large texts. In: Proc. of the ACM Symposium on Applied Computing (SAC), pp. 1724–1731 (2009)
Fickett, J.W.: Fast optimal alignment. Nucleic Acids Research 12(1), 175–179 (1984)
Fredkin, E.: Trie memory. Commun. of the ACM 3, 490–499 (1960)
Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In: Proc. of the ICDM Workshop on Frequent Itemset Mining Implementations (2003)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (Almost) for free. In: Proc. of the Intl. Conf. on Very Large Databases (VLDB), pp. 491–500. Morgan Kaufmann (2001)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proc. of the Intl. World Wide Web Conf. (WWW), pp. 90–101 (2003)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press (1997)
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: A Frequent-Pattern tree approach. Data Mining and Knowledge Discovery 8(1) (2004)
Jampani, R., Pudi, V.: Using Prefix-Trees for Efficiently Computing Set Joins. In: Zhou, L., Ooi, B.-C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453, pp. 761–772. Springer, Heidelberg (2005)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady (1966)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: Proc. of the Intl. Conf. on Data Engineering (ICDE), pp. 257–266. IEEE Computer Society (2008)
Liu, X., Li, G., Feng, J., Zhou, L.: Effective indices for efficient approximate string search and similarity join. In: Proc. of the Intl. Conf. on Web-Age Information Management, pp. 127–134. IEEE Computer Society (2008)
Morrison, D.R.: PATRICIA – practical algorithm to retrieve information coded in alphanumeric. Journal of the ACM 15(4), 514–534 (1968)
Myers, E.: A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994)
Myers, G.: A fast bit-vector algorithm for approximate string matching based on dynamic programming. Journal of the ACM 46(3), 395–415 (1999)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1) (2001)
Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24, 2001 (2000)
Rabin, M.O., Scott, D.: Finite automata and their decision problems. IBM J. Res. Dev. 3, 114–125 (1959)
Rheinländer, A., Knobloch, M., Hochmuth, N., Leser, U.: Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data. In: Gertz, M., Ludäscher, B. (eds.) SSDBM 2010. LNCS, vol. 6187, pp. 519–536. Springer, Heidelberg (2010)
Rheinländer, A., Leser, U.: Scalable sequence similarity search in main memory on multicores. In: International Workshop on High Performance in Bioinformatics and Biomedicine, HiBB (2011)
Sahinalp, S.C., Tasan, M., Macker, J., Ozsoyoglu, Z.M.: Distance based indexing for string proximity search. In: Proc. of the Intl. Conf. on Data Engineering (ICDE), pp. 125–136 (2003)
Shang, H., Merrett, T.: Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering (TKDE) 8, 540–547 (1996)
Vintsyuk, T.K.: Speech discrimination by dynamic programming. Cybernetics and Systems Analysis 4, 52–57 (1968)
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: Proc. of the ACM Intl. Conf. on Management of Data (SIGMOD), pp. 759–770 (2009)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. of the VLDB Endowment 1, 933–944 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fenz, D., Lange, D., Rheinländer, A., Naumann, F., Leser, U. (2012). Efficient Similarity Search in Very Large String Sets. In: Ailamaki, A., Bowers, S. (eds) Scientific and Statistical Database Management. SSDBM 2012. Lecture Notes in Computer Science, vol 7338. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31235-9_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-31235-9_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-31234-2
Online ISBN: 978-3-642-31235-9
eBook Packages: Computer ScienceComputer Science (R0)