Indexing a Dictionary for Subset Matching Queries

Landau, Gad M.; Tsur, Dekel; Weimann, Oren

doi:10.1007/978-3-540-75530-2_18

Gad M. Landau^1,2,
Dekel Tsur³ &
Oren Weimann⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4726))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

548 Accesses

Abstract

We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|.

To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of \(O(nm +|\Sigma|^k n \lg (\min\{n,m\}))\), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain \(O(nm+|\Sigma|^k n +|\Sigma|^{k/2} n\lg (\min\{n,m\}))\) preprocessing time and \(O(|p|\lg\lg|\Sigma|+\min\{|p|,\lg(|\Sigma|^kn)\}\lg \lg(|\Sigma|^kn))\) query time by cutting the dictionary strings and constructing two compressed tries.

Our problem is motivated by haplotype inference from a library of genotypes [14,7]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms. In particular, algorithms based on the “pure parsimony criteria” [13,16], greedy heuristics such as “Clarks rule” [6,18], EM based algorithms [1,11,12,20,26,30], and algorithms for inferring haplotypes from a set of Trios [4,27].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abecasis, G.R., Martin, R., Lewitzky, S.: Estimation of haplotype frequencies from diploid data. American Journal of Human Genetics 69(4 Suppl. 1), 114 (2001)
Google Scholar
Amir, A., Apostolico, A., Landau, G.M., Satta, G.: Efficient text fingerprinting via parikh mapping. Journal of Discrete Algorithms 1(5-6), 409–421 (2003)
Article MATH MathSciNet Google Scholar
Apostolico, A., Iliopoulos, C.S., Landau, G.M., Schieber, B., Vishkin, U.: Parallel construction of a suffix tree with applications. Algorithmica 3, 347–365 (1988)
Article MATH MathSciNet Google Scholar
Brinza, D., He, J., Mao, W., Zelikovsky, A.: Phasing and missing data recovery in family trios. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 1011–1019. Springer, Heidelberg (2005)
Google Scholar
Chazelle, B., Guibas, L.J.: Fractional cascading: I. a data structuring technique. Algorithmica 1(2), 133–162 (1986)
Article MATH MathSciNet Google Scholar
Clark, A.G.: Inference of haplotypes from pcr-amplified samples of diploid population. Molecular Biology and Evolution 7(2), 111–122 (1990)
Google Scholar
Cole, R., Gottlieb, L., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proceedings of the 36th annual ACM Symposium on Theory Of Computing (STOC), pp. 91–100. ACM Press, New York (2004)
Google Scholar
Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proceedings of the 34th annual ACM Symposium on Theory Of Computing (STOC), pp. 592–601. ACM Press, New York (2002)
Google Scholar
Cole, R., Kopelowitz, T., Lewenstein, M.: Suffix trays and suffix trists: structures for faster text indexing. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 358–369. Springer, Heidelberg (2006)
Chapter Google Scholar
Didier, G., Schmidt, T., Stoye, J., Tsur, D.: Character sets of strings. Journal of Discrete Algorithms (to appear)
Google Scholar
Excoffier, L., Slatkin, M.: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology and Evolution 12(5), 921–927 (1995)
Google Scholar
Fallin, D., Schork, N.J.: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. American Journal of Human Genetics 67(4), 947–959 (2000)
Article Google Scholar
Gusfield, D.: Haplotype inference by pure parsimony. In: Baeza-Yates, R.A., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 144–155. Springer, Heidelberg (2003)
Chapter Google Scholar
Gusfield, D., Orzack, S.H.: Haplotype inference. In: Aluru, S. (ed.) CRC handbook on bioinformatics (2005)
Google Scholar
Hagerup, T., Miltersen, P.B., Pagh, R.: Deterministic dictionaries. J. of Algorithms 41(1), 69–85 (2001)
Article MATH MathSciNet Google Scholar
Hajiaghayi, M.T., Jain, K., Konwar, K., Lau, L.C., Mandoiu, I.I., Vazirani, V.V.: Minimum multicolored subgraph problem in multiplex pcr primer set selection and population haplotyping. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J.J. (eds.) ICCS 2006. LNCS, vol. 3991, pp. 758–766. Springer, Heidelberg (2006)
Chapter Google Scholar
Halldórsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S., Istrail, S.: A survey of computational methods for determining haplotypes. In: Istrail, S., Waterman, M.S., Clark, A. (eds.) Computational Methods for SNPs and Haplotype Inference. LNCS (LNBI), vol. 2983, pp. 26–47. Springer, Heidelberg (2004)
Google Scholar
Halperin, E., Karp, R.M.: The minimum-entropy set cover problem. In: Díaz, J., Karhumäki, J., Lepistö, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 733–744. Springer, Heidelberg (2004)
Google Scholar
Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM Journal of Computing 13(2), 338–355 (1984)
Article MATH MathSciNet Google Scholar
Hawley, M.E., Kidd, K.K.: Haplo: A program using the em algorithm to estimate the frequencies of multi-site haplotypes. Journal of Heredity 86, 409–411 (1995)
Google Scholar
Helmuth, L.: Genome research: Map of human genome 3.0. Science 5530(293), 583–585 (2001)
Article Google Scholar
Indyk, P.: Faster algorithms for string matching problems: Matching the convolution bound. In: Proceedings of the 39th annual Symposium on Foundations of Computer Science (FOCS), pp. 166–173 (1998)
Google Scholar
Karp, R.M., Miller, R.E., Rosenberg, A.L.: Rapid identification of repeated patterns in strings, trees and arrays. In: Proceedings of the 4th annual ACM Symposium on Theory Of Computing (STOC), pp. 125–136. ACM Press, New York (1972)
Google Scholar
Kedem, Z.M., Landau, G.M., Palem, K.V.: Parallel suffix-prefix-matching algorithm and applications. SIAM Journal of Computing 25(5), 998–1023 (1996)
Article MATH MathSciNet Google Scholar
Kolpakov, R., Raffinot, M.: New algorithms for text fingerprinting. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 342–353. Springer, Heidelberg (2006)
Chapter Google Scholar
Long, J.C., Williams, R.C., Urbanek, M.: An E-M algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics 56(2), 799–810 (1995)
Google Scholar
Marchini, J., Cutler, D., Patterson, N., Stephens, M., Eskin, E., Halperin, E., Lin, S., Qin, Z.S., Munro, H.M., Abecasis, G., Donnelly, P.: The International HapMap Consortium.A comparison of phasing algorithms for trios and unrelated individuals. American Journal of Human Genetics 78, 437–450 (2006)
Google Scholar
Shi, Q., JáJá, J.: Novel transformation techniques using q-heaps with applications to computational geometry. SIAM Journal of Computing 34(6), 1471–1492 (2005)
Article Google Scholar
van Emde Boas, P.: Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters 6(3), 80–82 (1977)
Article MATH Google Scholar
Zhang, P., Sheng, H., Morabia, A., Gilliam, T.C.: Optimal step length em algorithm (oslem) for the estimation of haplotype frequency and its application in lipoprotein lipase genotyping. BMC Bioinformatics 4(3) (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Haifa, Haifa, Israel
Gad M. Landau
Department of Computer and Information Science, Polytechnic University, New York, USA
Gad M. Landau
Department of Computer Science, Ben-Gurion University, Beer-Sheva, Israel
Dekel Tsur
Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge MA, USA
Oren Weimann

Authors

Gad M. Landau
View author publications
You can also search for this author in PubMed Google Scholar
Dekel Tsur
View author publications
You can also search for this author in PubMed Google Scholar
Oren Weimann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Nivio Ziviani Ricardo Baeza-Yates

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Landau, G.M., Tsur, D., Weimann, O. (2007). Indexing a Dictionary for Subset Matching Queries. In: Ziviani, N., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol 4726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75530-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-540-75530-2_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75529-6
Online ISBN: 978-3-540-75530-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics