4.2 Context-Free Grammars

 

The formal Definition of a Context-Free Grammar

  1. Terminals are the basic symbols from which strings are formed. The term “token name” is a synonym for “terminal” and frequently we will use the word “token” for terminal when it is clear that we are talking about just the token name. We assume that the terminals are the first components of the tokens output by the lexical analyzer. Examples of terminals are the keywords if and else and the symbols “(“, and “)“.
  2. Nonterminals are syntactic variables that denote sets of strings. Examples are stmt and expr. The sets of strings denoted by nonterminals help define the language generated by the grammar. Nonterminals impose a hierarchical structure on the language that is key to syntax analysis and translation.
  3. In a grammar, one nonterminal is distinguished as the start symbols, and the set of string it denotes is the language generated by the grammar. Conventionally, the productions for the start symbol are listed first.
    1. A nonterminal called the head or left side of the production; this production defines some of the strings denoted by the head.
    2. The symbol ->. Sometimes ::= has been used in place of the arrow.
    3. A body or right side consisting of zero or more terminals and nonterminals. The components of the body describe one way in which strings of the nonterminal at the head can be constructed.

Notational Conventions

  1. These symbols are terminals:
    1. Lowercase letters early in the alphabet, such as a, b, c
    2. Operator symbols such as +, *, and so on
    3. Punctuation symbols such as parentheses, comma, and so on.
    4. The digits 0, 1, …, 9
    5. Boldface strings such as id or if, each of which represents a single terminal symbol.
  2. These symbols are nonterminals:
    1. Uppercase letters early in the alphabet, such as A, B, C
    2. The letter S, which, when it appears, is usually the start symbol.
    3. Lowercase, italic names such as expr or stmt.
    4. When discussing programming constructs, uppercase letters may be used to represent nonterminals for the constructs. For example, nonterminals for expressions, terms, and factors are often represented by E, T, and F, respectively.
    5. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either nonterminals or terminals.
    6. Lowercase letters late in the alphabet, chiefly u, v, …, z, represent (possibly empty) strings of terminals.
  3. Lowercase Greek letters, α, β, γ for example, represent (possibly empty) strings of grammar symbols. Thus, a generic production can be written as A -> α, where A is the head and α is the body.
  4. A set of productions A -> α1, A -> α2, …, A -> αK with a common head A (call them A-productions), may be written as A -> α1 | α2 | … | αK . Call α1, α2, …, αK the alternatives for A.
  5. Unless stated otherwise, the head of the first production is the start symbol.

Summary of Chapter 4

SYNTAX ANALYSIS

Parsers. A parser takes as input tokens from the lexical analyzer and treats the token names as terminal symbols of a context-free grammar. The parser then constructs a parse tree for its input sequence of tokens; the parse tree may be constructed figuratively (by going through the corresponding derivation steps) or literally.

Context-Free Grammars. A grammar specifies a set of terminal symbols (inputs, another set of nonterminals (symbols representing syntactic constructs), and a set of productions, each of which gives a way in which strings represented by one nonterminal can be constructed from terminal symbols and strings represented by certain other nonterminals. A production consists of a head (the nonterminal to be replaced) and a body (the replacing string of grammar symbols).

Derivations. The process of starting with the start-nonterminal of a grammar and successively replacing it by the body of one of its productions is called a derivation. If the leftmost (or rightmost) nonterminal is always replaced, then the derivation is called (leftmost (respectively, rightmost).

Parse Trees. A parse tree is a picture of a derivation, in which there is a node for each nonterminal that appears in the derivation. The children of a node are the symbols by which that nonterminal is replaced in the derivation. There is a one-to-one correspondence between parse trees, leftmost derivations, and rightmost derivations of the same terminal string.

Ambiguity. A grammar for which some terminal string has two or more different parse trees, or equivalently two or more leftmost derivations or two or more rightmost derivations, is said to be ambiguous. In most cases of practical interest, it is possible to redesign an ambiguous grammar so it becomes an unambiguous grammar for the same language. However, ambiguous grammars with certain tricks applied sometimes lead to more efficient parsers.

Top-Down and Bottom-Up Parsing. Parsers are generally distinguished by whether they work top-down (start with the grammar’s start symbol and construct the parse tree from the top) or bottom-up (start with the terminal symbols that form the leaves of the parse tree and build the tree from the bottom). Top-down parsers include recursive-descent and LL parsers, while the most common forms of the bottom-up parsers are LR parsers.

Design of Grammars. Grammars suitable for top-down parsing often are harder to design than those used by bottom-up parsers. It is necessary to eliminate left-recursion, a situation where one nonterminal derives a string that begins with the same nonterminal. We also must left-factor – group productions for the same nonterminal that have a common prefix in the body.

Recursive-Descent Parsers. These parsers use a procedure for each nonterminal. The procedure looks at its input and decides which production to apply for its nonterminal. Terminals in the body of the production are matched to the input at the appropriate time, while nonterminals in the body result in calls to their procedure. Backtracking, in the case when the wrong production was chosen, is a possibility.

LL(1) Parsers. A grammar such that it s possible to choose the correct production with which to expand a given nonterminal, looking only at the next input symbol, is called LL(1). These grammars allow us to construct a predictive parsing table that gives, for each nonterminal and each lookahead symbol, the correct choice of production. Error correction can be facilitated by placing error routines in some or all of the table entries that have no legitimate production.

Shift-Reduce Parsing. Bottom-up parsers generally operate by choosing, on the basis of the next input symbol (lookahead symbol) and the contents of the stack, whether to shift the next input onto the stack, or to reduce some symbols at the top of the stack. A reduce step takes a production body at the top of the stack and replaces it by the head of the production.

Viable Prefixes. In shift-reduce parsing, the stack contents are always a viable prefix – that is a prefix of some right-sentential form that ends no further right than the end of the handle of that right-sentential form. The handle is the substring that was introduced in the last step of the rightmost derivation of that sentential form.

Valid Items. An item is a production with a dot somewhere in the body. An item is a valid for viable prefix if the production of that item is used to generate the handle, and the viable prefix includes all those symbols to the left of the dot, but not those below.

LR Parsers. Each of the several kinds of LR parsers operate by first constructing the sets of valid items (called LR states) for all possible viable prefixes, and keeping track of the state for each prefix on the stack. The set of valid items guide the shift-reduce parsing decision. We prefer to reduce if there is a valid item with the dot at the right end of the body, and we prefer to shift the lookahead symbol onto the stack if that symbol appears immediately to the right of the dot in some valid item.

Simple LR Parsers. In an SLR parser, we perform a reduction implied by a valid item with a dot at the right end, provided the lookahead symbol can follow the head of that production in some sentential form. The grammar is SLR, and this method can be applied, if there are no parsing-action conflicts; that is, for no set of items, and for no lookahead symbol, are there two productions to reduce by, nor is there the option to reduce or to shift.

Canonical-LR Parsers. This more complex form of LR parser uses items that are augmented by the set of lookathead symbols that can follow the use of the underlying production. Reductions are only chosen when there is a valid item with the dot at the right end, and the current lookahead symbol is one of those allowed for this item. A canonical-LR parser can avoid some of the parsing-action conflicts that are present in the SLR parsers, but often has many more states than the SLR parser for the same grammar.

Lookahead-LR Parsers. LALR parsers offer many of the advantages of SLR and Canonical-LR parsers, by combining the states that have the same kernels (sets of items, ignoring the associated lookahead sets). Thus, the number of states is the same as that of the SLR parser, but some parsing-action conflicts present in the SLR parser may be removed in the LALR parser. LALR parsers have become the method of choice in practice.

Bottom-Up Parsing of Ambiguous Grammars. In many important situations, such as parsing arithmetic expressions, we can use an ambiguous grammar, and exploit side information such as the precedence of operators to resolve conflicts between shifting and reducing, or between reduction by two different productions. Thus, LR parsing techniques extend to many ambiguous grammars.

Yacc. The parser-generator Yacc takes a (possibly) ambiguous grammar and conflict-resolution information and constructs the LALR states. It then produces a function that uses these states to perform a bottom-up parse and call an associated function each time a reduction is performed.

 

 

Summary of Chapter 3

LEXICAL ANALYSIS

Tokens. The lexical analyzer scans the source program and produces as output a sequence of tokens, which are normally passed, one at a time to the parser. Some tokens may consist only of a token name while others may also have an associated lexical value that gives information about the particular instance of the token that has been found on the input.

Lexemes. Each time the lexical analyzer returns a token to the parser, it has an associated lexeme – the sequence of input characters that the token represents.

Buffering. Because it is often necessary to scan ahead on the input in order to see where the next lexeme ends, it is usually necessary for the lexical analyzer to buffer its input. Using a pair of buffers cyclicly and ending each buffer’s contents with a sentinel that warns of its end are two techniques that accelerate the process of scanning the input.

Patterns. Each token has a pattern that describes which sequences of characters can form the lexemes corresponding to that token. The set of words, or strings of characters, that match a given pattern is called a language.

Regular Expressions. These expressions are commonly used to describe patterns. Regular expressions are built from single characters, using union, concatenation, and the Kleene closure, or any-number-of, operator.

Regular Definitions. Complex collections of languages, such as the patterns that describe the tokens of a programming language, are often defined by a regular definition, which is a sequence of statements that each define one variable to stand for some regular expression. The regular expression for one variable can use previously defined variables in its regular expression.

Extended Regular-Expression Notation. A number of additional operators may appear as shorthands in regular expressions, to make it easier to express patterns. Examples include the + operator (one-or-more-of), ? (zero-or-one-of), and character classes (the union of the strings each consisting of one of the characters).

Transition Diagrams. The behavior of a lexical analyzer can often be described by a transition diagram. These diagrams have states, each of which represents something about the history of the characters seen during the current search for a lexeme that matches one of the possible patterns. There are arrows, or transitions, from one state to another, each of which indicates the possible next input character that cause the lexical analyzer to make that change of state.

Finite Automata. These are a formalization of transition diagrams that include a designation of a start state and one or more accepting state, as well as the set of states, input characters, and transitions among states. Accepting state indicate that the lexeme for some token has been found. Unlike transition diagrams, finite automata can make transitions on empty input as well as on input characters.

Deterministic Finite Automata. A DFA is a special kind of finite automaton that has exactly one transition out of each state for each input symbol. Also, transitions on empty input are disallowed. The DFA is easily simulated and makes a good implementation of a lexical analyzer, similar to a transition diagram.

Nondeterministic Finite Automata. Automata that are not DFA’s are called nondeterministic. NFA’s often are easier to design than are DFA’s. Another possible architecture for a lexical analyzer is to tabulate all the states that NFA’s for each of the possible patterns can be in, as we scan the input characters.

Conversion Among Pattern Representations. It is possible to convert any regular expression into an NFA of about the same size, recognizing the same language as the regular expression defines. Further, any NFA can be converted to a DFA for the same pattern, although in the worst case (never encountered in common programming languages the size of the automaton can grow exponentially. It is also possible to convert any nondeterministic or deterministic finite automaton into regular expression that defines the same language recognized by the finite automaton.

Lex. There is a family of software systems, including Lex and Flex, that are lexical-analyzer generators. The user specifies the patterns for tokens using an extended regular-expression notation. Lex converts these expressions into a lexical analyzer that is essentially a deterministic finite automaton that recognizes any of the patterns.

Minimization of Finite Automata. For every DFA there is a minimum-state DFA accepting the same language. Moreover, the minimum-state DFA for a given language is unique except for the names given to the various states. 

Summary of Chapter 2

A SIMPLE SYNTAX-DIRECTED TRANSLATION

The syntax-directed techniques in this chapter can be used to construct compiler front ends, such as those illustrated in the follow figure:

 translations of a statement

The starting point for a syntax-directed translator is a grammar for the source language. A grammar describes the hierarchical structure of programs. It is defined in terms of elementary symbols called terminals and variable symbols called nonterminals. These symbols represent language constructs. The rules or productions of a grammar consist of a nonterminal called the head or left side of a production and a sequence of terminals and nonterminals called the body or right side of the production. One nonterminal is designated as the start symbol.

In specifying a translator, it is helpful to attach attributes to programming construct, where an attribute is any quantity associated with a construct. Since constructs are represented by grammar symbols, the concept of attributes extends to grammar symbols. Examples of attributes include an integer value associated with a terminal num representing numbers, and a string associated with a terminal id representing identifiers.

A lexical analyzer reads the input one character at a time and produces as output a stream of tokens, where a token consists of a terminal symbol along with additional information in the form of attribute values. In the figure being presented, tokens are written as tuples enclosed between <>. The token <id, “peek”> consists of the terminal id and a pointer to the symbol-table entry containing the string “peek”. The translator uses the table to keep track of reserved words and identifiers that have already been seen.

Parsing is the problem of figuring out how a string of terminals can be derived from the start symbol of the grammar by repeatedly replacing a nonterminal by the body of one of its productions. Conceptually, a parser builds a parse tree in which the root is labeled with the start symbol, each nonleaf corresponds to a production, and each leaf is labeled with a terminal or the empty string ε. The parse tree derives the string of terminals at the leaves, read from left to right.

Efficient parsers can be built by hand, using a top-down (from the root to the leaves of a parse tree) method called predictive parsing. A predictive parser has a procedure for each nonterminal; procedure bodies mimic the productions for nonterminals; and, the flow of control through the procedure bodies can be determined unambiguously by looking one symbol ahead in the input stream.

Syntax-directed translation is done by attaching either rules or program fragments to productions in a grammar. In this chapter, we have considered only synthesized attributes – the value of a synthesized attribute at any node x can depend only on attributes at the children of x, if any. A syntax-directed definition attaches rules to productions; the rules compute attribute values. A translation scheme embeds program fragments called semantic actions in production bodies. The actions are executed in the order that productions are used during syntax analysis.

The result of syntax analysis is a representation of the source program, called intermediate code. Two primary forms of intermediate code are illustrated in the Figure above. An abstract syntax tree has nodes for programming constructs; the children of a node given the meaningful subconstructs. Alternatively, three-address code is a sequence of instructions in which each instruction carries out a single operation.

Symbol tables are data structures that hold information about identifiers. Information is put into the symbol table when the declaration of an identifier is analyzed. A semantic action gets information from the symbol table when the identifier is subsequently used, for example, as a factor in an expression.

Summary of Chapter 1

 Language Processors. An integrated software development environment includes many different kinds of language processors such as compiler, interpreters, assemblers, linkers, loaders, debuggers, profilers.

Compiler Phases. A compiler operates as a sequence of phases, each of which transforms the source program from one intermediate representation to another.

Machine and Assembly Languages. Machine languages were the first-generation programming languages, followed by assembly languages. Programming in these languages was time consuming and error prone.

Modeling in Compiler Design. Compiler design is one of the places where theory has had the most impact on practice. Models that have been found useful include automata, grammars, regular expressions, trees, and many  others.

Code Optimization. Although code cannot truly be “optimized,” the science of improving the efficiency of code is both complex and very important. It is a major portion of the study of compilation.

Higher-Level Languages. As time goes on, programming languages take on progressively more of the tasks that formerly were left to the programmer, such as memory management, type-consistency checking, or parallel execution of code.

Compilers and Computer Architecture. Compiler technology influences computer architecture, as well as being influenced by the advances in architecture. Many modern innovations in architecture depend on compilers being able to extract from source programs the opportunities to use the hardware capability effectively.

Software Productivity and Software Security. The same technology that allows compilers to optimize code can be used for a variety of program-analysis tasks, ranging from detecting common program bugs to discovering that a program is vulnerable to one of the many kinds of intrusions that “hackers” have discovered.

Scope Rules. The scope of a declaration is x is the context in which uses of x refer to this declaration. A language uses static scope or lexical scope if it is possible to determine the scope of a declaration by looking only at the program. Otherwise, the language uses dynamic scope.

Environments. The association of names which locations in memory and then with values can be described in terms of environments, which map names to locations in store, the states, which map locations to their values.

Block Structure. Languages that allow blocks to be nested are said to have block structure. A name x in a nested block B, is in the scope of a declaration D of x in an enclosing block if there is no other declaration of x in an intervening block.

Parameter Passing. Parameters are passed from a calling procedure to the callee either by value or by reference. When large objects are passed by value, the values passed are really references to the objects themselves, resulting in an effective call-by-reference.

Aliasing. When parameters are (effectively) passed by reference, two formal parameters can refer to the same object. This possibility allows a change in one variable to change another.