Skip to main content
Log in

On the hardness of learning queries from tree structured data

  • Published:
Journal of Combinatorial Optimization Aims and scope Submit manuscript

Abstract

The problem of learning queries from tree structured data is studied by this paper. A tree structured data is modeled as a node-labeled tree \(T\), and applying a query \(q\) on \(T\) will return a set \(q(T)\) which is a subset of nodes in \(T\). For a tree-node pair \((T,t)\) where \(t\) is a node in \(T\), \(q\) is called to accept the pair if \(t\in {q(T)}\), and reject the pair if \(t\notin {q(T)}\). For some query class \(\mathcal{L }\), given tree-node pair sets \(E_p\) and \(E_n\), the tree query learning problem is to find a query \(q\in \mathcal{L }\) such that (1) \(q\) rejects all pairs in \(E_n\), and (2) the size of pairs in \(E_p\) accepted by \(q\) is maximized. On four different query classes \(\mathcal Q ^{\tiny /}\), \(\mathcal Q ^{\tiny /,*}\), \(\mathcal Q ^{\tiny /,//}\) and \(\mathcal Q ^{\tiny /,[]}\), this paper studies the hardness of the corresponding tree query learning problems. For \(\mathcal Q ^{\tiny /}\), a PTime algorithm is given. For \(\mathcal Q ^{\tiny /,*}\) and \(\mathcal Q ^{\tiny /,//}\), the NP-complete results are shown. For \(\mathcal Q ^{\tiny /,[]}\), the problem is shown to be NP-hard by considering two constrained fragments of \(\mathcal Q ^{\tiny /,[]}\). Also, for \(\mathcal Q ^{\tiny /,*}\), \(\mathcal Q ^{\tiny /,[]}\) and \(\mathcal Q ^{\tiny /,//}\), it is shown that there are no \(n^{1-\epsilon }\)-approximation algorithms for any \(\epsilon >0\).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Abiteboul S, Buneman P, Suciu D (2000) Data on the web: from relations to semistructured data and xml. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Amer-Yahia S, Cho S, Lakshmanan LVS, Srivastava D (2002) Tree pattern query minimization. VLDB J 11(4):315–331

    Google Scholar 

  • Angluin D (1980) Inductive inference of formal languages from positive data. Inf Control 45(2):117–135

    Article  MATH  MathSciNet  Google Scholar 

  • Angluin D (1987) Learning regular sets from queries and counterexamples. Inf Comput 75:87–106

    Article  MATH  MathSciNet  Google Scholar 

  • Angluin D (1990) Negative results for equivalence queries. Mach Learn 5(2):121–150

    Google Scholar 

  • Bex GJ, Neven F, Schwentick T, Vansummeren S (2010) Inference of concise regular expressions and dtds. ACM Trans Database Syst (TODS) 35(2):11:1–11:47

    Article  Google Scholar 

  • Boag S, Chamberlin D, Fernandez M, Florescu D, Robie J, Simeon J, Stefanescu M (2002) Xquery 1.0: an xml query language, http://www.w3.org/TR/xquery

  • Carme J, Ceresna M, Goebel M (2006) Query-based learning of xpath expressions. In: ICGI, pp 342–343

  • Carme J, Gilleron R, Lemay A, Niehren J (2007) Interactive learning of node selecting tree transducer. Mach Learn 66(1):33–67

    Article  Google Scholar 

  • Deutch A, Fernandez M, Florescu D, Levy A, Suciu D (1999) A query language for xml. In: Proceedings of WWW

  • Garey MR, Johnson DS (1990) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co., New York

  • Gold EM (1967) Language identification in the limit. Inf Control 10(5):447–474

    Google Scholar 

  • Gonzalez G, Tari L, Gitter A, Leaman R, Nikkila S, Wendt R, Zeigler A, Baral C (2007) Integrating knowledge from biomedical literature: Normalization and evidence statements for interactions. In: Proceedings of the second bioCreative challenge evaluation workshop, pp 227–236

  • Higuera Cdl (1997) Characteristic sets for polynomial grammatical inference. Machine Learn 27(2):125–138

    Google Scholar 

  • Jagadish HV, Milo T, Srivastava D, Vista D (1999) Querying network directories. In: SIGMOD, pp 133–144

  • Jiang T, Lin G, Ma B, Zhang K (2002) A general edit distance between rna structures. J Comput Biol 9:371–388

    Article  Google Scholar 

  • Kearns MJ, Vazirani UV (1994) An introduction to computational learning theory. MIT Press, Cambridge

    Google Scholar 

  • Lemay A, Niehren J, Gilleron R (2006) Learning n-ary node selecting tree transducers from completely annotated examples. In: ICGI, pp 253–267

  • Miyano S, Shinohara A, Shinohara T (2000) Polynomial-time learning of elementary formal systems. New Gen Comput 18(3):217–242

    Article  Google Scholar 

  • Sarma AD, Parameswaran A, Garcia-Molina H, Widom J (2010) Synthesizing view definitions from data. In: ICDT

  • Staworko S, Wieczorek P (2012) Learning twig and path queries. In: ICDT

  • Weis M, Naumann F (2005) Dogmatix tracks down duplicates in xml. In: SIGMOD

  • Zuckerman D (2006) Linear degree extractors and the inapproximability of max clique and chromatic number. In: STOC

Download references

Acknowledgments

The work in this paper was partially supported by the National Basic Research (973) Program of China under Grant No. 2012CB316202, the National Natural Science Foundation of China under Grant No. 61003046 and No. 6111113089. We would like to thank Dongjing Miao of Harbin Institute of Technology for his valuable discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xianmin Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, X., Li, J. On the hardness of learning queries from tree structured data. J Comb Optim 29, 670–684 (2015). https://doi.org/10.1007/s10878-013-9609-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10878-013-9609-9

Keywords

Navigation