4.2 Context-Free Grammars


The formal Definition of a Context-Free Grammar

  1. Terminals are the basic symbols from which strings are formed. The term “token name” is a synonym for “terminal” and frequently we will use the word “token” for terminal when it is clear that we are talking about just the token name. We assume that the terminals are the first components of the tokens output by the lexical analyzer. Examples of terminals are the keywords if and else and the symbols “(“, and “)“.
  2. Nonterminals are syntactic variables that denote sets of strings. Examples are stmt and expr. The sets of strings denoted by nonterminals help define the language generated by the grammar. Nonterminals impose a hierarchical structure on the language that is key to syntax analysis and translation.
  3. In a grammar, one nonterminal is distinguished as the start symbols, and the set of string it denotes is the language generated by the grammar. Conventionally, the productions for the start symbol are listed first.
    1. A nonterminal called the head or left side of the production; this production defines some of the strings denoted by the head.
    2. The symbol ->. Sometimes ::= has been used in place of the arrow.
    3. A body or right side consisting of zero or more terminals and nonterminals. The components of the body describe one way in which strings of the nonterminal at the head can be constructed.

Notational Conventions

  1. These symbols are terminals:
    1. Lowercase letters early in the alphabet, such as a, b, c
    2. Operator symbols such as +, *, and so on
    3. Punctuation symbols such as parentheses, comma, and so on.
    4. The digits 0, 1, …, 9
    5. Boldface strings such as id or if, each of which represents a single terminal symbol.
  2. These symbols are nonterminals:
    1. Uppercase letters early in the alphabet, such as A, B, C
    2. The letter S, which, when it appears, is usually the start symbol.
    3. Lowercase, italic names such as expr or stmt.
    4. When discussing programming constructs, uppercase letters may be used to represent nonterminals for the constructs. For example, nonterminals for expressions, terms, and factors are often represented by E, T, and F, respectively.
    5. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either nonterminals or terminals.
    6. Lowercase letters late in the alphabet, chiefly u, v, …, z, represent (possibly empty) strings of terminals.
  3. Lowercase Greek letters, α, β, γ for example, represent (possibly empty) strings of grammar symbols. Thus, a generic production can be written as A -> α, where A is the head and α is the body.
  4. A set of productions A -> α1, A -> α2, …, A -> αK with a common head A (call them A-productions), may be written as A -> α1 | α2 | … | αK . Call α1, α2, …, αK the alternatives for A.
  5. Unless stated otherwise, the head of the first production is the start symbol.