skip to main content
research-article

Transducing Markov sequences

Published: 08 September 2014 Publication History

Abstract

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any subexponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.

References

[1]
Arica, N. and Yarman-Vural, F. T. 2002. Optical character recognition for cursive handwriting. IEEE Trans. Pattern Anal. Mach. Intell. 24, 6, 801--813.
[2]
Bonner, A. J. and Mecca, G. 1998. Sequences, datalog, and transducers. J. Comput. Syst. Sci. 57, 3, 234--259.
[3]
Bonner, A. J. and Mecca, G. 2000. Querying sequence databases with transducers. Acta Inf. 36, 7, 511--544.
[4]
Boulos, J., Dalvi, N. N., Mandhani, B., Mathur, S., Ré, C., and Suciu, D. 2005. MYSTIQ: a system for finding more answers by using probabilities. In Proceedings of SIGMOD Conference. ACM, 891--893.
[5]
Califf, M. E. and Mooney, R. J. 1999. Relational learning of pattern-match rules for information extraction. In Proceedings of the AAAI/IAAI Conference on Artificial Intelligence. AAAI Press / The MIT Press, 328--334.
[6]
Chen, M.-Y., Kundu, A., and Zhou, J. 1994. Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Trans. Pattern Anal. Mach. Intell. 16, 5, 481--496.
[7]
Cheng, R., Kalashnikov, D. V., and Prabhakar, S. 2003. Evaluating probabilistic queries over imprecise data. In Proceedings of SIGMOD Conference. ACM, 551--562.
[8]
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., and Vaithyanathan, S. 2010. SystemT: An algebraic approach to declarative information extraction. In Proceedings of ACL. The Association for Computer Linguistics, 128--137.
[9]
Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A. 2007. Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recog. 10, 3--4, 157--174.
[10]
Cohen, S., Kimelfeld, B., and Sagiv, Y. 2008. Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci. 74, 7, 1147--1159.
[11]
Cohen, S., Kimelfeld, B., and Sagiv, Y. 2009. Running tree automata on probabilistic XML. In Proceedings of PODS. ACM, 227--236.
[12]
Dalvi, N. N. and Suciu, D. 2004. Efficient query evaluation on probabilistic databases. In Proceedings of VLDB, Morgan-Kaufmann, 864--875.
[13]
Dalvi, N. N. and Suciu, D. 2007. The dichotomy of conjunctive queries on probabilistic structures. In Proceedings of PODS. ACM, 293--302.
[14]
Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J. M., and Hong, W. 2004. Model-driven data acquisition in sensor networks. In Proceedings of VLDB, Morgan-Kaufmann, 588--599.
[15]
Diao, Y., Li, B., Liu, A., Peng, L., Sutton, C., Tran, T., and Zink, M. 2009. Capturing data uncertainty in high-volume stream processing. In Proceedings of CIDR. www.crdrdb.org.
[16]
Downey, R. G. and Fellows, M. R. 1995. Fixed-parameter tractability and completeness I: Basic results. SIAM J. Comput. 24, 4, 873--921.
[17]
Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. J. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.
[18]
Eppstein, D. 1998. Finding the k shortest paths. SIAM J. Comput. 28, 2, 652--673.
[19]
Escoffier, B. and Paschos, V. T. 2005. Differential approximation of min sat, max sat and related problems. In Proceedings of ICCSA (4). Lecture Notes in Computer Science, vol. 3483, Springer, 192--201.
[20]
Fagin, R., Kimelfeld, B., Reiss, F., and Vansummeren, S. 2013. Spanners: A formal framework for information extraction. In Proceedings of PODS. ACM, 37--48.
[21]
Fagin, R., Lotem, A., and Naor, M. 2003. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66, 4, 614--656.
[22]
Fosgate, C. H., Krim, H., Irving, W. W., Karl, W. C., and Willsky, A. S. 1997. Multiscale segmentation and anomaly enhancement of SAR imagery. IEEE Trans. Image Process. 6, 1, 7--20.
[23]
Gantz, J. F., Reinsel, D., Chute, C., Schlichting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., and Manfrediz, A. 2007. The expanding digital universe: A forecast of worldwide information growth through 2010. http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf.
[24]
Håstad, J. 1996. Clique is hard to approximate within n1 − . In Proceedings of FOCS. IEEE Computer Society, 627--636.
[25]
HMMER. 2010. Biosequence analysis using hidden Markov models. http://hmmer.janelia.org/.
[26]
HTK. 2009. The hidden Markov toolkit. http://htk.eng.cam.ac.uk/.
[27]
Huang, J., Antova, L., Koch, C., and Olteanu, D. 2009. MayBMS: A probabilistic database management system. In Proceedings of SIGMOD Conference. ACM, 1071--1074.
[28]
Jirásková, G. 2005. State complexity of some operations on binary regular languages. Theoret. Comput. Sci. 330, 2, 287--298.
[29]
Johnson, D., Yannakakis, M., and Papadimitriou, C. 1988. On generating all maximal independent sets. Inf. Process. Lett. 27, 119--123.
[30]
Kanagal, B. and Deshpande, A. 2008. Online filtering, smoothing and probabilistic modeling of streaming data. In Proceedings of ICDE. IEEE, 1160--1169.
[31]
Kanagal, B. and Deshpande, A. 2009a. Efficient query evaluation over temporally correlated probabilistic streams. In Proceedings of ICDE. IEEE, 1315--1318.
[32]
Kanagal, B. and Deshpande, A. 2009b. Indexing correlated probabilistic databases. In Proceedings of SIGMOD Conference. ACM, 455--468.
[33]
Kannan, S., Sweedyk, Z., and Mahaney, S. R. 1995. Counting and random generation of strings in regular languages. In Proceedings of SODA. ACM/SIAM, 551--557.
[34]
Kempe, A. 1997. Finite state transducers approximating hidden Markov models. In Proceedings of ACL. Morgan-Kaufmann, 460--467.
[35]
Kimelfeld, B., Kosharovsky, Y., and Sagiv, Y. 2008. Query efficiency in probabilistic XML models. In Proceedings of SIGMOD Conference. ACM, 701--714.
[36]
Kimelfeld, B. and Ré, C. 2010. Transducing Markov sequences. In Proceedings of PODS. ACM, 15--26.
[37]
Kimelfeld, B. and Sagiv, Y. 2006. Finding and approximating top-k answers in keyword proximity search. In Proceedings of PODS. ACM, 173--182.
[38]
Kimelfeld, B. and Sagiv, Y. 2007. Maximally joining probabilistic data. In Proceedings of PODS. ACM, 303--312.
[39]
Kimelfeld, B. and Sagiv, Y. 2008. Efficiently enumerating results of keyword search over data graphs. Inf. Syst. 33, 4--5, 335--359.
[40]
Koch, C. 2008. Approximating predicates and expressive queries on probabilistic databases. In Proceedings of PODS. ACM, 99--108.
[41]
Koch, C. 2009. A compositional query algebra for second-order logic and uncertain databases. In Proceedings of ICDT. ACM, 127--140.
[42]
Kschischang, F. R., Frey, B. J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 47, 2, 498--519.
[43]
Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. Morgan-Kaufmann, 282--289.
[44]
Lawler, E. L. 1972. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Manage. Sci. 18, 401--405.
[45]
Letchner, J., Ré, C., Balazinska, M., and Philipose, M. 2009a. Access methods for Markovian streams. In Proceedings of ICDE. IEEE, 246--257.
[46]
Letchner, J., Ré, C., Balazinska, M., and Philipose, M. 2009b. Lahar demonstration: Warehousing Markovian streams. In Proceedings of PVLDB 2, 2, 1610--1613.
[47]
Ludäscher, B., Mukhopadhyay, P., and Papakonstantinou, Y. 2002. A transducer-based XML query processor. In Proceedings of VLDB. Morgan-Kaufmann, 227--238.
[48]
Martens, W. and Neven, F. 2003. Typechecking top-down uniform unranked tree transducers. In Proceedings of ICDT. Lecture Notes in Computer Science, vol. 2572, Springer, 64--78.
[49]
Murty, K. G. 1968. An algorithm for ranking all the assignments in order of increasing costs. Oper. Res. 16, 682--687.
[50]
Papadimitriou, C. H. and Yannakakis, M. 1999. On the complexity of database queries. J. Comput. Syst. Sci. 58, 3, 407--427.
[51]
Provan, J. S. and Ball, M. O. 1983. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput. 12, 4, 777--788.
[52]
Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 257--286.
[53]
Ré, C., Letchner, J., Balazinska, M., and Suciu, D. 2008. Event queries on correlated probabilistic streams. In Proceedings of SIGMOD Conference. ACM, 715--728.
[54]
Ré, C. and Suciu, D. 2008. Approximate lineage for probabilistic databases. Proc. VLDB 1, 1, 797--808.
[55]
Sarma, A. D., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of ICDE. IEEE, 1023--1032.
[56]
Seshadri, P., Livny, M., and Ramakrishnan, R. 1995. SEQ: A model for sequence databases. In Proceedings of ICDE. IEEE Computer Society, 232--239.
[57]
Sha, F. and Pereira, F. 2003. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL. The Association for Computational Linguistics.
[58]
Sha, F. and Saul, L. K. 2006. Large margin hidden Markov models for automatic speech recognition. In Proceedings of NIPS. MIT Press, 1249--1256.
[59]
Shen, W., Doan, A., Naughton, J. F., and Ramakrishnan, R. 2007. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of VLDB. 1033--1044.
[60]
Singh, S., Mayfield, C., Shah, R., Prabhakar, S., Hambrusch, S. E., Neville, J., and Cheng, R. 2008. Database support for probabilistic attributes and tuples. In Proceedings of ICDE. IEEE, 1053--1061.
[61]
Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34, 1--3, 233--272.
[62]
Toda, S. and Ogiwara, M. 1992. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput. 21, 2, 316--328.
[63]
Tran, T., Sutton, C., Cocci, R., Nie, Y., Diao, Y., and Shenoy, P. J. 2009. Probabilistic inference over RFID streams in mobile environments. In Proceedings of ICDE. IEEE, 1096--1107.
[64]
Valiant, L. G. 1979. The complexity of computing the permanent. Theor. Comput. Sci. 8, 189--201.
[65]
Vardi, M. Y. 1982. The complexity of relational query languages (extended abstract). In Proceedings of STOC. ACM, 137--146.
[66]
Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of CIDR. www.crdrdb.org, 262--276.
[67]
Yen, J. Y. 1971. Finding the k shortest loopless paths in a network. Manage. Sci. 17, 712--716.
[68]
Zachos, S. 1988. Probabilistic quantifiers and games. J. Comput. Syst. Sci. 36, 3, 433--451.
[69]
Zhang, C., Baldwin, T., Ho, H., Kimelfeld, B., and Li, Y. 2013. Adaptive parser-centric text normalization. In Proceedings of ACL (1). The Association for Computer Linguistics, 1159--1168.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM
Journal of the ACM  Volume 61, Issue 5
August 2014
171 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/2668245
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2014
Accepted: 01 May 2014
Revised: 01 May 2014
Received: 01 September 2010
Published in JACM Volume 61, Issue 5

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Markov sequences
  2. enumeration
  3. hidden Markov models
  4. probabilistic databases
  5. ranked query evaluation
  6. transducers

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 607
    Total Downloads
  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media