SurveyConjunctive and Boolean grammars: The true general case of the context-free grammars☆
Introduction
Syntax of languages, both natural and artificial, is defined inductively, in the sense that syntactic properties of strings of symbols are logically determined by the properties of their substrings. A grammar of a particular language gives names to these syntactic properties, and explains, how shorter strings with certain properties can be concatenated to obtain longer strings with another property. For instance, a rule of a hypothetical grammar for a natural language may say that a subject followed by a predicate is a sentence; the form of a subject and a predicate is defined by other rules of the grammar. Similarly, a grammar for a programming language may define a loop statement as a keyword while followed by an expression and a statement (where the latter may, in particular, be another loop statement). An abstract language over an alphabet is completely defined by saying that a string belongs to this language if and only if it is either an empty sequence of symbols, or a string of the form , where is a string belonging to this language.
Definitions of this kind are naturally given when the syntax of any language needs to be clearly described, such as in the textbook on the English grammar by Reed and Kellogg [1] or in the first description of the Algol programming language [2]. Following Chomsky’s [3] influential works, such definitions became known as context-free grammars, reflecting the fact that syntactic properties of a substring do not depend on the context, in which it occurs. The form of the rules was fixed to where the symbol represents a syntactic notion defined in the grammar, such as a sentence or a loop statement (these auxiliary notions are called “nonterminal symbols” for historical reasons), and each symbol may be another nonterminal symbol or a symbol of the target alphabet. A rule () means that every string representable as a concatenation therefore has the property . Returning to the above examples, the hypothetical grammar for a natural language has a rule , while the grammar for the abstract language consists of two rules, and (where denotes the empty string), and represents a complete inductive definition.
Context-free grammars may be thought of as a logic for inductive descriptions of syntax, in which the propositional connectives available for combining syntactical conditions are restricted to disjunction only. Indeed, having multiple rules for a single nonterminal essentially represents disjunction: two rules and mean that a string has the property if and only if it is representable as or as . Any other Boolean operations, such as conjunction and negation, are not expressible using ordinary context-free grammars: that is, there is no general way for specifying all strings that satisfy a condition and at the same time another condition , and no way to denote the set of strings that do not satisfy a condition . Furthermore, it is well-known that intersection of two context-free languages or the complement of a context-free language need not be context-free [4]; in other words, in this particular logic, conjunction or negation cannot be represented through disjunction only.
This omission in the formalism suggests the idea of defining a more general logic, which would maintain the main principles behind the context-free grammars, but at the same time extend the set of available propositional connectives. The result could then be regarded as a completion of the incomplete standard definition of context-free grammars. Early attempts in this direction were made by Latta and Wall [5] and by Heilbrunner and Schmitz [6], who proposed formalisms for specifying Boolean combinations of context-free languages. Latta and Wall [5], in particular, argued for the relevance of their formalism to linguistics. However, the use of conjunction and negation in these grammars was heavily restricted, and one still could not use them as freely as the disjunction.
An extension of the definition of context-free grammars featuring unrestricted conjunction, introduced by the author [7], is known as a conjunctive grammar. In these grammars, the conjunction of two syntactical conditions can be directly expressed in the form of a rule which asserts that every string representable both as and as therefore has the property . The more general Boolean grammars [8] further extend the definition by allowing an explicit negation, that is, every Boolean operation is directly expressible in their formalism. For instance, the set of strings representable as , and at the same time not representable as , can be written as a rule Both types of grammars remain essentially context-free, in the sense, that the deduction of the properties of a string does not depend on the context, where it occurs; the properties of a string are defined as a function of the properties of the substrings, into which the string can be split. Therefore, as rightfully noted by Kountouriotis et al. [9], “conjunctive context-free grammars” and “Boolean context-free grammars” would be appropriate names for these models. Along with most of the literature, this paper assumes the shorter names, yet refers to the fragment of Boolean grammars featuring the disjunction only as to the ordinary context-free grammars.
All semiformal interpretations of rules given so far show the intended meaning of grammars, but are not yet definitions in the mathematical sense. Viewing grammars as a logic, the most direct approach for defining their semantics is by introducing a deduction of elementary statements of the form “string has property ”, which are inferred from each other according to the rules of the grammar. This general approach may be regarded as folklore; in particular, it was used by Sikkel [10] to explain computations carried out by parsers. Alternatively, the same logical dependencies can be represented by interpreting a grammar as a system of equations with formal languages as unknowns, as done by Ginsburg and Rice [11]. Finally, the most widespread definition of ordinary context-free grammars, given by Chomsky [3], is by string rewriting, when a rule is regarded as a production for rewriting a symbol with a substring , so that an abstract symbol for a sentence is eventually rewritten into an actual sentence. However, it must be stressed that even though the definition by rewriting is indeed the simplest one, it is nothing more than a convenient characterization, which leaves behind the true nature of formal grammars: that is, logical dependence between items of the form “string has property ”. Identifying formal grammars with rewriting systems is a grave error in judgement.
A conjunctive grammar can be defined in the same three ways as an ordinary context-free grammar: by a deduction system [12] with appropriately extended inference rules, by language equations [13] involving the intersection operation, and by rewriting [7], which is augmented to a special kind of term rewriting. These definitions are explained in detail in Section 2 of this survey, along with several characteristic examples of conjunctive grammars, which demonstrate, how conjunction can be put to use in the well-known setting of inductive definitions of syntax.
The definition of Boolean grammars is more complicated, because a grammar may express a contradiction of the form , which states that a string has the property if and only if it does not have this property. Thus, for Boolean grammars, the dependence of items of the form “string has property ” has a more complicated form, which calls for being expressed by equations. The simpler approach to the definition uses a system of language equations, in which the negation is interpreted by complementation, and imposes a certain condition upon this system, which ensures that it has a unique solution; this unique solution then defines the meaning of a grammar. A more general definition was given by Kountouriotis et al. [14], who interpreted a Boolean grammar in terms of three-valued languages, so that a string may belong to a language, not belong to it, or have an undetermined membership status. Both methods are explained in Section 3. There is no known definition of Boolean grammars by rewriting.
This survey paper is aimed to present the research on conjunctive and Boolean grammars carried out over the last decade, and to justify the thesis that these grammars are the true general case of the context-free grammars. The crucial points in support of this statement are that, on the one hand, conjunctive and Boolean grammars maintain the main inductive principles behind the ordinary context-free grammars, which account for their intuitive clarity and suitability for representing syntax, and only offer additional logical connectives within the same framework; these further expressive means allow giving meaningful descriptions of quite a few syntactic constructs not representable by ordinary context-free grammars. On the other hand, this extra power does not damage the crucial properties of context-free grammars: the intuitive clarity of descriptions is preserved, the upper bounds on time complexity remain the same, and most of the parsing algorithms are directly inherited from the context-free case. In particular, the basic bottom-up parsing algorithms for ordinary context-free grammars of the general form, such as the Cocke–Kasami–Younger [15], [16] and its variants [17], [18], [19], extend to Boolean grammars so smoothly and obviously, that one can hardly see any reason for limiting logical connectives in a grammar to disjunction only. Applying some other algorithms, such as the Lang–Tomita generalized LR [20], [21] and the recursive descent, to Boolean grammars requires elaborating their flow control, but, in general, the elaboration amounts to having a parser compute conjunction and negation, wherever these operations occur in the grammar.
Although conjunctive and Boolean grammars inherit many practical properties of ordinary context-free grammars, they have some essential differences in their theoretical properties. One of these differences concerns sublinear-time parallel recognition algorithms operating on a circuit: such algorithms are known for ordinary context-free grammars [22], [23], [24], but most likely do not exist already for conjunctive grammars, since these grammars are capable of representing some P-complete languages. Another difference concerns the decision problems: for an ordinary context-free grammar, one can effectively test whether it generates a non-empty language, but it is undecidable whether a given grammar generates the set of all strings; both problems are undecidable for conjunctive grammars. Yet another difference concerns grammars over a one-symbol alphabet: while ordinary context-free grammars are limited to regular subsets of , conjunctive grammars can generate a wide variety of one-symbol languages [25], [26], [27].
The last topic of this survey is summarizing the properties of conjunctive and Boolean grammars, and comparing them with the properties of other important families of formal grammars. Considering all types of grammars together, and understanding conjunctive and Boolean grammars as an essential part of the theory of formal grammars, leads to a new outlook on grammars as such. This new outlook begins with a new classification of meaningful families of formal grammars, done in terms of the amount of ambiguity and nondeterminism, various motivated restrictions on the form of rules (such as linear concatenation), and the set of allowed logical connectives (limited to disjunction alone in ordinary context-free grammars).
The proposed classification of grammars notably ignores the first and still the most well-known classification of families of formal languages: the Chomsky hierarchy. Why is it ignored? Chomsky’s hierarchy is comprised of the regular languages (“type 3”), the context-free languages (“type 2”), the nondeterministic linear space (“type 1”) and the recursively enumerable sets (“type 0”). These are the families of languages considered in the early days of computer science by Chomsky [3], who had formalized the intuitive notion of a formal grammar using string-rewriting systems, and then attempted to implement further linguistic ideas by altering this definition. This had a surprising outcome: though none of the modifications had anything to do with syntax, all three of them turned out to be important models of computation: “type 0” is a reformulation of a nondeterministic Turing machine, “type 3” reformulates nondeterministic finite automata, and “type 1” became the first computational complexity class to be ever considered. Putting these three models of computability, along with the basic model of syntax, within a single framework had a significant impact on the early development of the theory of computation, and Chomsky’s hierarchy remains a milestone in the history of computer science. However, as far as models of syntax are being concerned, this hierarchy did not serve its purpose. Despite decades of subsequent laborious studies, the research in string-rewriting systems centred around context-free rewriting revealed no other viable model of syntax besides the context-free grammar. This leads to a conclusion that the representation of context-free grammars by string rewriting is a unique coincidence, rather than a systematic association between rewriting and grammars. Furthermore, the ungrammatical levels of the Chomsky hierarchy (“type 1” and “type 0”) are useless even as a point of reference, because meaningful syntax has complexity much below . In spite of its historical importance, the Chomsky hierarchy is hardly relevant anymore.
The research on formal grammars carried out over the last fifty years revealed quite a few important families of formal grammars, obtained by restricting ordinary context-free grammars: grammars (and the deterministic context-free languages they generate), linear grammars, unambiguous grammars, etc. These families form the basis of the proposed hierarchy of formal grammars, which is then extended towards the conjunctive and Boolean grammars, as well as to their subfamilies defined by analogy with the subfamilies of ordinary context-free grammars: such as, for instance, linear conjunctive grammars or unambiguous Boolean grammars. In Section 8, all families in the hierarchy are compared in terms of their expressive power, closure properties under basic operations and the decidability and complexity of various properties. The expressive power of different grammars is furthermore related to the computational complexity classes between and .
The survey is concluded with some suggested directions for research on conjunctive and Boolean grammars, and on formal grammars in general. First, there is a list of significant theoretical open problems, with an award of $360 Canadian offered by the author [28] for the first correct solution of each problem. Nine problems were originally stated in 2006; since then, two problems were solved [25], [29], and, at the time of writing, seven remain open. This survey includes the statements of all problems, and briefly comments on possible approaches to them. Furthermore, some general questions worth investigation are suggested, including possible discovery of new variants of formal grammars, as well as implementation and application of conjunctive and Boolean grammars as they are.
Section snippets
Three equivalent definitions
A conjunctive grammar is a quadruple , in which:
- •
is the alphabet of the language being defined, that is, a finite set of symbols, from which the strings in the language are built;
- •
is a finite set of auxiliary notions used in the grammar, each of them represents a syntactic property that a string in may have or not have; for historical reasons, they are called nonterminal symbols or nonterminals, even though this name ought to have been deprecated long ago;
- •
is a finite set of
Intuitive definition
Boolean grammars are context-free grammars equipped with all propositional connectives, or, in other words, conjunctive grammars augmented with negation. Conversely, conjunctive grammars are the monotone fragment of Boolean grammars.
A Boolean grammar is a quadruple , in which
- •
is the alphabet;
- •
is the set of nonterminal symbols;
- •
is a finite set of rules of the form with and ;
- •
is the initial symbol.
The only difference from a
Grammars with linear concatenation
A special case of ordinary context-free grammars, which can express a concatenation of a nonterminal symbol only with terminal strings, is known as a linear context-free grammar. In such grammars, every rule has . These grammars are notable for their lower computational complexity and other noteworthy properties.
Similarly to the case of grammars with disjunction only, a conjunctive grammar is called linear conjunctive, if every rule it contains is either of the form
Basic parsing algorithms
Parsing means decomposing a string into substrings according to a grammar, and verifying that it is a well-formed sentence. Given a string as an input, a parsing algorithm should determine whether the string belongs to the language described by a fixed (or a given) grammar, and if it does, construct a parse tree of the string, as it is defined by the grammar.
Advanced approaches to parsing
This section describes several parsing methods, that have theoretically superior performance to the basic parsing algorithms discussed above. Even though some of them are quite unlikely to be useful in practice, they are important for understanding the theoretical complexity of formal grammars.
Grammars over a one-symbol alphabet
Conjunctive grammars over a unary alphabet form a special area of study. Though such grammars are completely irrelevant to the main purpose of formal grammars, that of representing syntax, they are theoretically important as a pure case of conjunctive grammars, which already shows some of their distinctive properties. Furthermore, they are crucial in the study of language equations, where their properties form the basis of the study of the more general language equations over a unary alphabet
Hierarchy of language families
In order to compare the expressive power of meaningful models of syntax, one should begin with compiling a list of such models. The main point of reference are, of course, the ordinary context-free grammars (CF). Many important families of languages are defined by restricting context-free grammars in one or another way. Prohibiting syntactic ambiguity leads to the unambiguous context-free grammars (UnambCF), and to their special cases: the context-free grammars, which define the
Nine theoretical problems
The previous survey of Boolean grammars [28] introduced nine open problems, each concerned with some theoretical property of Boolean grammars. Since then, two problems have been solved, and seven others remain open.1 An award is offered for the first correct solution of each of the remaining problems.2
Acknowledgements
The author was supported by the Academy of Finland under grants 134860 and 257857.
References (145)
On certain formal properties of grammars
Information and Control
(1959)Note on the Boolean properties of context free languages
Information and Control
(1960)- et al.
An efficient recognizer for the Boolean closure of context-free languages
Theoretical Computer Science
(1991) Boolean grammars
Information and Computation
(2004)- et al.
A game-theoretic characterization of Boolean grammars
Theoretical Computer Science
(2011) The dual of concatenation
Theoretical Computer Science
(2005)- et al.
Well-founded semantics for Boolean grammars
Information and Computation
(2009) Recognition and parsing of context-free languages in time
Information and Control
(1967)On uniform circuit complexity
Journal of Computer and System Sciences
(1981)A grammatical characterization of alternating pushdown automata
Theoretical Computer Science
(1989)
A characterization of exponential-time languages by alternating context-free grammars
Theoretical Computer Science
Alternating context-free languages and linear time -calculus with sequential composition
Electronic Notes in Theoretical Computer Science
Unambiguous Boolean grammars
Information and Computation
Conjunctive grammars with restricted disjunction
Theoretical Computer Science
Decision problems for language equations
Journal of Computer and System Sciences
Locally stratified Boolean grammars
Information and Computation
One-way bounded cellular automata
Information and Control
Systolic trellis automata: stability, decidability and complexity
Information and Control
Characterizations and computational complexity of systolic trellis automata
Theoretical Computer Science
Variations of the firing squad problem and applications
Information Processing Letters
An 8-state minimal time solution to the firing squad synchronization problem
Information and Control
On real-time one-way cellular array
Theoretical Computer Science
A property of real-time trellis automata
Discrete Applied Mathematics
A recognition and parsing algorithm for arbitrary conjunctive grammars
Theoretical Computer Science
On the translation of languages from left to right
Information and Control
Scannerless Boolean parsing, LDTA 2006
Electronic Notes in Theoretical Computer Science
Expressive power of Boolean grammars
Theoretical Computer Science
Matrix multiplication via arithmetic progressions
Journal of Symbolic Computation
Fast on-line integer multiplication
Journal of Computer and System Sciences
Parsing Boolean grammars over a one-letter alphabet using online convolution
Theoretical Computer Science
The hardest linear conjunctive language
Information Processing Letters
A simple P-complete problem and its language-theoretic representations
Theoretical Computer Science
Commutative grammars: the complexity of uniform word problems
Information and Control
On the number of nonterminals in linear conjunctive grammars
Theoretical Computer Science
On the expressive power of univariate equations over sets of natural numbers
Information and Computation
On the closure properties of linear conjunctive languages
Theoretical Computer Science
Higher Lessons in English
Preliminary report: international algebraic language
Communications of the ACM
Conjunctive grammars
Journal of Automata, Languages and Combinatorics
Parsing Schemata
Two families of languages related to ALGOL
Journal of the ACM
Conjunctive grammars and systems of language equations
Programming and Computer Software
An improved context-free recognizer
ACM Transactions on Programming Languages and Systems
A syntax-analysis procedure for unambiguous context-free grammars
Journal of the ACM
General context-free recognition in less than cubic time
Journal of Computer and System Sciences
Deterministic techniques for efficient non-deterministic parsers
An efficient augmented context-free parsing algorithm
Computational Linguistics
A parallel algorithm for context-free parsing
Australian Computer Science Communications
Cited by (52)
Inductive definitions in logic versus programs of real-time cellular automata
2024, Theoretical Computer ScienceA computation model with automatic functions and relations as primitive operations
2022, Theoretical Computer ScienceFormal languages over GF(2)
2022, Information and ComputationCitation Excerpt :In the ordinary kind of formal grammars, called “context-free grammars” in Chomsky's tradition, the available operations are union and concatenation. Other grammar families, such as linear grammars or conjunctive grammars [18], differ from the ordinary grammars in the sets of operations allowed in the rules: in linear grammars, the operations include concatenation with a single symbol on either side, as well as union, whereas in conjunctive grammars, the operations are union, intersection and concatenation. This paper initiates the study of a new model, the GF(2)-grammars, with the operations of symmetric difference and GF(2)-concatenation.
Edit distance neighbourhoods of input-driven pushdown automata
2019, Theoretical Computer ScienceHardest languages for conjunctive and Boolean grammars
2019, Information and ComputationCitation Excerpt :The importance of conjunctive grammars is justified by two facts: on the one hand, they enrich the standard inductive definitions of syntax with a useful logical operation, which is sufficient to express several syntactic constructs beyond the scope of ordinary grammars. On the other hand, conjunctive grammars have generally the same parsing algorithms as ordinary grammars [1,26], and the same subcubic upper bound on the time complexity of parsing [27]. Among the numerous theoretical results on conjunctive grammars, the one particularly relevant for this paper is the closure of the language family described by conjunctive grammars under inverse homomorphisms, and, more generally, under inverse deterministic finite transductions [20].
Linear-space recognition for grammars with contexts
2018, Theoretical Computer Science
- ☆
This paper supersedes the earlier surveys, “An overview of conjunctive grammars” (Bulletin of the EATCS, 2004) and “Nine open problems for conjunctive and Boolean grammars” (Bulletin of the EATCS, 2007).