research-article

Transducing Markov sequences

Authors:

Benny Kimelfeld,

Christopher RéAuthors Info & Claims

Journal of the ACM (JACM), Volume 61, Issue 5

Article No.: 32, Pages 1 - 48

https://doi.org/10.1145/2630065

Published: 08 September 2014 Publication History

Abstract

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any subexponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.

References

[1]

Arica, N. and Yarman-Vural, F. T. 2002. Optical character recognition for cursive handwriting. IEEE Trans. Pattern Anal. Mach. Intell. 24, 6, 801--813.

Digital Library

[2]

Bonner, A. J. and Mecca, G. 1998. Sequences, datalog, and transducers. J. Comput. Syst. Sci. 57, 3, 234--259.

Digital Library

[3]

Bonner, A. J. and Mecca, G. 2000. Querying sequence databases with transducers. Acta Inf. 36, 7, 511--544.

Digital Library

[4]

Boulos, J., Dalvi, N. N., Mandhani, B., Mathur, S., Ré, C., and Suciu, D. 2005. MYSTIQ: a system for finding more answers by using probabilities. In Proceedings of SIGMOD Conference. ACM, 891--893.

Digital Library

[5]

Califf, M. E. and Mooney, R. J. 1999. Relational learning of pattern-match rules for information extraction. In Proceedings of the AAAI/IAAI Conference on Artificial Intelligence. AAAI Press / The MIT Press, 328--334.

Digital Library

[6]

Chen, M.-Y., Kundu, A., and Zhou, J. 1994. Off-line handwritten word recognition using a hidden Markov model type stochastic network. IEEE Trans. Pattern Anal. Mach. Intell. 16, 5, 481--496.

Digital Library

[7]

Cheng, R., Kalashnikov, D. V., and Prabhakar, S. 2003. Evaluating probabilistic queries over imprecise data. In Proceedings of SIGMOD Conference. ACM, 551--562.

Digital Library

[8]

Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., and Vaithyanathan, S. 2010. SystemT: An algebraic approach to declarative information extraction. In Proceedings of ACL. The Association for Computer Linguistics, 128--137.

Digital Library

[9]

Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., and Basu, A. 2007. Investigation and modeling of the structure of texting language. Int. J. Doc. Anal. Recog. 10, 3--4, 157--174.

Digital Library

[10]

Cohen, S., Kimelfeld, B., and Sagiv, Y. 2008. Generating all maximal induced subgraphs for hereditary and connected-hereditary graph properties. J. Comput. Syst. Sci. 74, 7, 1147--1159.

Digital Library

[11]

Cohen, S., Kimelfeld, B., and Sagiv, Y. 2009. Running tree automata on probabilistic XML. In Proceedings of PODS. ACM, 227--236.

Digital Library

[12]

Dalvi, N. N. and Suciu, D. 2004. Efficient query evaluation on probabilistic databases. In Proceedings of VLDB, Morgan-Kaufmann, 864--875.

Digital Library

[13]

Dalvi, N. N. and Suciu, D. 2007. The dichotomy of conjunctive queries on probabilistic structures. In Proceedings of PODS. ACM, 293--302.

Digital Library

[14]

Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J. M., and Hong, W. 2004. Model-driven data acquisition in sensor networks. In Proceedings of VLDB, Morgan-Kaufmann, 588--599.

Digital Library

[15]

Diao, Y., Li, B., Liu, A., Peng, L., Sutton, C., Tran, T., and Zink, M. 2009. Capturing data uncertainty in high-volume stream processing. In Proceedings of CIDR. www.crdrdb.org.

[16]

Downey, R. G. and Fellows, M. R. 1995. Fixed-parameter tractability and completeness I: Basic results. SIAM J. Comput. 24, 4, 873--921.

Digital Library

[17]

Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G. J. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press.

[18]

Eppstein, D. 1998. Finding the k shortest paths. SIAM J. Comput. 28, 2, 652--673.

Digital Library

[19]

Escoffier, B. and Paschos, V. T. 2005. Differential approximation of min sat, max sat and related problems. In Proceedings of ICCSA (4). Lecture Notes in Computer Science, vol. 3483, Springer, 192--201.

Digital Library

[20]

Fagin, R., Kimelfeld, B., Reiss, F., and Vansummeren, S. 2013. Spanners: A formal framework for information extraction. In Proceedings of PODS. ACM, 37--48.

Digital Library

[21]

Fagin, R., Lotem, A., and Naor, M. 2003. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66, 4, 614--656.

Digital Library

[22]

Fosgate, C. H., Krim, H., Irving, W. W., Karl, W. C., and Willsky, A. S. 1997. Multiscale segmentation and anomaly enhancement of SAR imagery. IEEE Trans. Image Process. 6, 1, 7--20.

Digital Library

[23]

Gantz, J. F., Reinsel, D., Chute, C., Schlichting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., and Manfrediz, A. 2007. The expanding digital universe: A forecast of worldwide information growth through 2010. http://www.emc.com/collateral/analyst-reports/expanding-digital-idc-white-paper.pdf.

[24]

Håstad, J. 1996. Clique is hard to approximate within n1 − ∈. In Proceedings of FOCS. IEEE Computer Society, 627--636.

Digital Library

[25]

HMMER. 2010. Biosequence analysis using hidden Markov models. http://hmmer.janelia.org/.

[26]

HTK. 2009. The hidden Markov toolkit. http://htk.eng.cam.ac.uk/.

[27]

Huang, J., Antova, L., Koch, C., and Olteanu, D. 2009. MayBMS: A probabilistic database management system. In Proceedings of SIGMOD Conference. ACM, 1071--1074.

Digital Library

[28]

Jirásková, G. 2005. State complexity of some operations on binary regular languages. Theoret. Comput. Sci. 330, 2, 287--298.

Digital Library

[29]

Johnson, D., Yannakakis, M., and Papadimitriou, C. 1988. On generating all maximal independent sets. Inf. Process. Lett. 27, 119--123.

Digital Library

[30]

Kanagal, B. and Deshpande, A. 2008. Online filtering, smoothing and probabilistic modeling of streaming data. In Proceedings of ICDE. IEEE, 1160--1169.

Digital Library

[31]

Kanagal, B. and Deshpande, A. 2009a. Efficient query evaluation over temporally correlated probabilistic streams. In Proceedings of ICDE. IEEE, 1315--1318.

Digital Library

[32]

Kanagal, B. and Deshpande, A. 2009b. Indexing correlated probabilistic databases. In Proceedings of SIGMOD Conference. ACM, 455--468.

Digital Library

[33]

Kannan, S., Sweedyk, Z., and Mahaney, S. R. 1995. Counting and random generation of strings in regular languages. In Proceedings of SODA. ACM/SIAM, 551--557.

Digital Library

[34]

Kempe, A. 1997. Finite state transducers approximating hidden Markov models. In Proceedings of ACL. Morgan-Kaufmann, 460--467.

Digital Library

[35]

Kimelfeld, B., Kosharovsky, Y., and Sagiv, Y. 2008. Query efficiency in probabilistic XML models. In Proceedings of SIGMOD Conference. ACM, 701--714.

Digital Library

[36]

Kimelfeld, B. and Ré, C. 2010. Transducing Markov sequences. In Proceedings of PODS. ACM, 15--26.

Digital Library

[37]

Kimelfeld, B. and Sagiv, Y. 2006. Finding and approximating top-k answers in keyword proximity search. In Proceedings of PODS. ACM, 173--182.

Digital Library

[38]

Kimelfeld, B. and Sagiv, Y. 2007. Maximally joining probabilistic data. In Proceedings of PODS. ACM, 303--312.

Digital Library

[39]

Kimelfeld, B. and Sagiv, Y. 2008. Efficiently enumerating results of keyword search over data graphs. Inf. Syst. 33, 4--5, 335--359.

Digital Library

[40]

Koch, C. 2008. Approximating predicates and expressive queries on probabilistic databases. In Proceedings of PODS. ACM, 99--108.

Digital Library

[41]

Koch, C. 2009. A compositional query algebra for second-order logic and uncertain databases. In Proceedings of ICDT. ACM, 127--140.

Digital Library

[42]

Kschischang, F. R., Frey, B. J., and Loeliger, H.-A. 2001. Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 47, 2, 498--519.

Digital Library

[43]

Lafferty, J., McCallum, A., and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. Morgan-Kaufmann, 282--289.

Digital Library

[44]

Lawler, E. L. 1972. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Manage. Sci. 18, 401--405.

[45]

Letchner, J., Ré, C., Balazinska, M., and Philipose, M. 2009a. Access methods for Markovian streams. In Proceedings of ICDE. IEEE, 246--257.

Digital Library

[46]

Letchner, J., Ré, C., Balazinska, M., and Philipose, M. 2009b. Lahar demonstration: Warehousing Markovian streams. In Proceedings of PVLDB 2, 2, 1610--1613.

Digital Library

[47]

Ludäscher, B., Mukhopadhyay, P., and Papakonstantinou, Y. 2002. A transducer-based XML query processor. In Proceedings of VLDB. Morgan-Kaufmann, 227--238.

Digital Library

[48]

Martens, W. and Neven, F. 2003. Typechecking top-down uniform unranked tree transducers. In Proceedings of ICDT. Lecture Notes in Computer Science, vol. 2572, Springer, 64--78.

Digital Library

[49]

Murty, K. G. 1968. An algorithm for ranking all the assignments in order of increasing costs. Oper. Res. 16, 682--687.

Digital Library

[50]

Papadimitriou, C. H. and Yannakakis, M. 1999. On the complexity of database queries. J. Comput. Syst. Sci. 58, 3, 407--427.

Digital Library

[51]

Provan, J. S. and Ball, M. O. 1983. The complexity of counting cuts and of computing the probability that a graph is connected. SIAM J. Comput. 12, 4, 777--788.

Digital Library

[52]

Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 257--286.

[53]

Ré, C., Letchner, J., Balazinska, M., and Suciu, D. 2008. Event queries on correlated probabilistic streams. In Proceedings of SIGMOD Conference. ACM, 715--728.

Digital Library

[54]

Ré, C. and Suciu, D. 2008. Approximate lineage for probabilistic databases. Proc. VLDB 1, 1, 797--808.

Digital Library

[55]

Sarma, A. D., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of ICDE. IEEE, 1023--1032.

Digital Library

[56]

Seshadri, P., Livny, M., and Ramakrishnan, R. 1995. SEQ: A model for sequence databases. In Proceedings of ICDE. IEEE Computer Society, 232--239.

Digital Library

[57]

Sha, F. and Pereira, F. 2003. Shallow parsing with conditional random fields. In Proceedings of HLT-NAACL. The Association for Computational Linguistics.

Digital Library

[58]

Sha, F. and Saul, L. K. 2006. Large margin hidden Markov models for automatic speech recognition. In Proceedings of NIPS. MIT Press, 1249--1256.

[59]

Shen, W., Doan, A., Naughton, J. F., and Ramakrishnan, R. 2007. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of VLDB. 1033--1044.

Digital Library

[60]

Singh, S., Mayfield, C., Shah, R., Prabhakar, S., Hambrusch, S. E., Neville, J., and Cheng, R. 2008. Database support for probabilistic attributes and tuples. In Proceedings of ICDE. IEEE, 1053--1061.

Digital Library

[61]

Soderland, S. 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn. 34, 1--3, 233--272.

Digital Library

[62]

Toda, S. and Ogiwara, M. 1992. Counting classes are at least as hard as the polynomial-time hierarchy. SIAM J. Comput. 21, 2, 316--328.

Digital Library

[63]

Tran, T., Sutton, C., Cocci, R., Nie, Y., Diao, Y., and Shenoy, P. J. 2009. Probabilistic inference over RFID streams in mobile environments. In Proceedings of ICDE. IEEE, 1096--1107.

Digital Library

[64]

Valiant, L. G. 1979. The complexity of computing the permanent. Theor. Comput. Sci. 8, 189--201.

[65]

Vardi, M. Y. 1982. The complexity of relational query languages (extended abstract). In Proceedings of STOC. ACM, 137--146.

Digital Library

[66]

Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of CIDR. www.crdrdb.org, 262--276.

[67]

Yen, J. Y. 1971. Finding the k shortest loopless paths in a network. Manage. Sci. 17, 712--716.

[68]

Zachos, S. 1988. Probabilistic quantifiers and games. J. Comput. Syst. Sci. 36, 3, 433--451.

Digital Library

[69]

Zhang, C., Baldwin, T., Ho, H., Kimelfeld, B., and Li, Y. 2013. Adaptive parser-centric text normalization. In Proceedings of ACL (1). The Association for Computer Linguistics, 1159--1168.

Index Terms

Transducing Markov sequences
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Transducing Markov sequences
PODS '10: Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and ...
Regular path queries under approximate semantics

We give a general framework for approximate query processing in semistructured databases. We focus on regular path queries, which are the integral part of most of the query languages for semistructured databases. To enable approximations, we allow the ...
Keyword query cleaning using hidden Markov models
KEYS '09: Proceedings of the First International Workshop on Keyword Search on Structured Data

In this paper, we consider the problem of keyword query cleaning for structured databases from a probabilistic approach. Keyword query cleaning consists of rewriting the user query, segmenting the keywords, matching each segment to database items, and ...

Comments

Information & Contributors

Information

Published In

cover image Journal of the ACM

Journal of the ACM Volume 61, Issue 5

August 2014

171 pages

ISSN:0004-5411

EISSN:1557-735X

DOI:10.1145/2668245

Editor:
Victor Vianu
University of California, San Diego

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2014

Accepted: 01 May 2014

Revised: 01 May 2014

Received: 01 September 2010

Published in JACM Volume 61, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
607
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents