Abstract
The problem of learning queries from tree structured data is studied by this paper. A tree structured data is modeled as a node-labeled tree \(T\), and applying a query \(q\) on \(T\) will return a set \(q(T)\) which is a subset of nodes in \(T\). For a tree-node pair \((T,t)\) where \(t\) is a node in \(T\), \(q\) is called to accept the pair if \(t\in {q(T)}\), and reject the pair if \(t\notin {q(T)}\). For some query class \(\mathcal{L }\), given tree-node pair sets \(E_p\) and \(E_n\), the tree query learning problem is to find a query \(q\in \mathcal{L }\) such that (1) \(q\) rejects all pairs in \(E_n\), and (2) the size of pairs in \(E_p\) accepted by \(q\) is maximized. On four different query classes \(\mathcal Q ^{\tiny /}\), \(\mathcal Q ^{\tiny /,*}\), \(\mathcal Q ^{\tiny /,//}\) and \(\mathcal Q ^{\tiny /,[]}\), this paper studies the hardness of the corresponding tree query learning problems. For \(\mathcal Q ^{\tiny /}\), a PTime algorithm is given. For \(\mathcal Q ^{\tiny /,*}\) and \(\mathcal Q ^{\tiny /,//}\), the NP-complete results are shown. For \(\mathcal Q ^{\tiny /,[]}\), the problem is shown to be NP-hard by considering two constrained fragments of \(\mathcal Q ^{\tiny /,[]}\). Also, for \(\mathcal Q ^{\tiny /,*}\), \(\mathcal Q ^{\tiny /,[]}\) and \(\mathcal Q ^{\tiny /,//}\), it is shown that there are no \(n^{1-\epsilon }\)-approximation algorithms for any \(\epsilon >0\).





Similar content being viewed by others
References
Abiteboul S, Buneman P, Suciu D (2000) Data on the web: from relations to semistructured data and xml. Morgan Kaufmann, San Francisco
Amer-Yahia S, Cho S, Lakshmanan LVS, Srivastava D (2002) Tree pattern query minimization. VLDB J 11(4):315–331
Angluin D (1980) Inductive inference of formal languages from positive data. Inf Control 45(2):117–135
Angluin D (1987) Learning regular sets from queries and counterexamples. Inf Comput 75:87–106
Angluin D (1990) Negative results for equivalence queries. Mach Learn 5(2):121–150
Bex GJ, Neven F, Schwentick T, Vansummeren S (2010) Inference of concise regular expressions and dtds. ACM Trans Database Syst (TODS) 35(2):11:1–11:47
Boag S, Chamberlin D, Fernandez M, Florescu D, Robie J, Simeon J, Stefanescu M (2002) Xquery 1.0: an xml query language, http://www.w3.org/TR/xquery
Carme J, Ceresna M, Goebel M (2006) Query-based learning of xpath expressions. In: ICGI, pp 342–343
Carme J, Gilleron R, Lemay A, Niehren J (2007) Interactive learning of node selecting tree transducer. Mach Learn 66(1):33–67
Deutch A, Fernandez M, Florescu D, Levy A, Suciu D (1999) A query language for xml. In: Proceedings of WWW
Garey MR, Johnson DS (1990) Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman & Co., New York
Gold EM (1967) Language identification in the limit. Inf Control 10(5):447–474
Gonzalez G, Tari L, Gitter A, Leaman R, Nikkila S, Wendt R, Zeigler A, Baral C (2007) Integrating knowledge from biomedical literature: Normalization and evidence statements for interactions. In: Proceedings of the second bioCreative challenge evaluation workshop, pp 227–236
Higuera Cdl (1997) Characteristic sets for polynomial grammatical inference. Machine Learn 27(2):125–138
Jagadish HV, Milo T, Srivastava D, Vista D (1999) Querying network directories. In: SIGMOD, pp 133–144
Jiang T, Lin G, Ma B, Zhang K (2002) A general edit distance between rna structures. J Comput Biol 9:371–388
Kearns MJ, Vazirani UV (1994) An introduction to computational learning theory. MIT Press, Cambridge
Lemay A, Niehren J, Gilleron R (2006) Learning n-ary node selecting tree transducers from completely annotated examples. In: ICGI, pp 253–267
Miyano S, Shinohara A, Shinohara T (2000) Polynomial-time learning of elementary formal systems. New Gen Comput 18(3):217–242
Sarma AD, Parameswaran A, Garcia-Molina H, Widom J (2010) Synthesizing view definitions from data. In: ICDT
Staworko S, Wieczorek P (2012) Learning twig and path queries. In: ICDT
Weis M, Naumann F (2005) Dogmatix tracks down duplicates in xml. In: SIGMOD
Zuckerman D (2006) Linear degree extractors and the inapproximability of max clique and chromatic number. In: STOC
Acknowledgments
The work in this paper was partially supported by the National Basic Research (973) Program of China under Grant No. 2012CB316202, the National Natural Science Foundation of China under Grant No. 61003046 and No. 6111113089. We would like to thank Dongjing Miao of Harbin Institute of Technology for his valuable discussions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, X., Li, J. On the hardness of learning queries from tree structured data. J Comb Optim 29, 670–684 (2015). https://doi.org/10.1007/s10878-013-9609-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10878-013-9609-9