You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.3 KiB

Context Free Grammars

Most modern programming languages specify their grammar using a context free grammar, CFG. A CFG consists of a finite set of syntactic variables, a special variable known as the start symbol, one or more production rules that correspond to each symbol, and a set of terminals.

A a grammar rule consists of three parts. The first part is a a variable on the left hand side. In the middle is the production symbol, ::=, and on the right hand side is a list of productions. A production is a list of variables and terminals. Note that a single variable may have more than one production associated with it.

Grammars are so important that most languages use a special short hand for describing context free grammars named EBNF. EBNF stands for Extended Backus Naur Form. Backus and Naur were two of the key developers of the Algol-60 programming language and were pioneers in the use of CFG's to specify programming languages.

In EBNF, the start symbol is the first rule presented. All of the productions associated with a given variable are separated with the '|' character.

Using a grammar to describe a language gives us a way to describe exactly what constitutes a legal program. More importantly, the grammatical structure of a language gives us clues we need to interpret what a program means.

Pushdown Automata

The machine that is equivalent to a CFG is a pushdown automata. A pushdown automata consists of a finite state machine control with an infinite stack. In this course, we will not have the time to formally define this machine, suffice it to say, there is a 7-tuple that formally defines this machine.

Pushdown automaton

A pushdown automaton differs from a finite state machine in that the top of the stack can affect the transition that occurs on input of a letter and the machine can manipulate the stack.

When a pushdown automaton starts, it pushes a symbol that represents the goal of matching the start production. As input is consumed, it can either push a symbol indicating that a new subgoal needs to be matched, or it may pop the symbol on the top of the stack indicating that pattern was matched. A pushdown automaton typically uses an empty stack as the final or accepting condition.

The action of pushing a new subgoal is known as a shift operation. The action of popping a subgoal is known as a reduce operation. A machine representing an ambiguous language may exhibit errors. These errors fall into two categories, shift/reduce and reduce/reduce. A shift/reduce error is when the machine could either push a new subgoal or recognize a subgoal. A reduce/reduce error is when the machine can not distinguish which of two patterns have just been matched.

One way a parser may resolve amiguities is by looking ahead in the input. Many parsers use one token of lookahead to help disambiguate what action should occur next.

The language of a PDA is the set of all input strings that leave the machine in a final condition.

Formal Definition of a CFG

A CFG is defined by a four-tuple.

G = (V, T, P, S)

  • V is the set of syntactic variables
  • T is the set of terminals
  • P is the set of productions
  • S is the start symbol

Parse Trees

Given a grammar, it is possible to show the grammatical structure of a program as a tree. This process of finding the productions that produce a given program is called parsing.

Rules for building a parse tree.

  • The root of the parse tree is the start symbol.
  • Each interior node is a variable in V.
  • Each leaf node is either a terminal or ℇ. If it is ℇ, then it must be an only child.
  • If an interior node is labeled with the variable A and has children X1, X2, ... Xn, then there must be a production A ::= X1, X2, ... Xn

Ambiguous Parse Trees

Consider the following grammar.

Expr ::= <number> | Expr '+' Expr | Expr '*' Expr

A grammar is termed ambiguous if there exist two different parse trees for the same input string. Demonstrate that the above grammar is ambiguous using the string 2 + 3 * 4.

Note that amiguity is by itself not necessarily a problem. The string 2 + 3 + 4 has an ambiguous parse, but because the semantics of + are associative, the normal interpretation of either parse tree would produce the same result.

When ambiguity is an issue, one way of resolving the ambiguity is by structuring the grammar to group operators of a common precedence together in the same rule. For example:

Expr ::= MulOp | Expr "+" MulOp
MulOp ::= Value | MulOp "*" Value
Value ::= <number> | "(" Expr ")"

Parsing Expression Grammars aka PEGs

A PEG, parsing expression grammar, is a formal grammar. It looks just like a context free grammar, but with some differences. Unlike a CFG, a PEG can never be ambiguous. If there is a valid parse tree, it will be the only one.

PEGs have recently become more popular. PEGs are becoming more popular for a number of reasons. One is that there is a very simple rule for resolving ambiguity. The more traditional CFG based tools like yacc and bison can be difficult when parsing ambiguities arise. Another is that complexity of a modern parsing built from a PEG has linear complexity relative to the length of the input. When PEGs were first explored, they had a combintorial complexity that rendered them useless for complex programs.