Return to lecture notes index
September 11, 2007(Lecture 5)

Credits

Much of today's lecture, including the examples are taken from Chapter 3 of the O'Reilly lex & yacc book. It is a great reference -- with 5 other great chapters:
Liveine, John, Mason, Tony, and Brown, Doug, lex & yacc, O'Reilly, 1995.

Processing Language

It is often the case that computer programs need to interpret structured input. This input is provided as a collection or elements organize conforming to a language.

The process for recognizing what is communicated by this type of input involves recognizing the direction provided by the input and taking appropriate action.

This process often takes the form of two basic steps: Lexical analysis and parsing. Lexical analysis is the process of recognizing the important elements of the language from the stream of input. These elements are often called tokens. Parsing is the process of determining the relationship among these tokens, in context, and initiating the appropriate actions.

In the UNIX environment, there are two "old standard" tools for this process: lex and yacc (or GNU's mostly compatible flex and bison). Today we are going to learn how to use lex to identify the tokens from among the stream of input. Next class we'll learn how to tie this to yacc in order to interpret the tokens in context and initiate the corresponding actions.

An Overview of Lex

Lex is a tool that take a specification file as input and generates a program in accordance with that specification. I'll sometimes refer to the lex specification as a "lex program". I'll call the program that it generates a lexer or analyzer for short.

Traditionally, the output of lex is the C language source code for the lexer. This program is then compiled to produce an executible. And, this is, in fact, what we'll do. But, I do want to note that lex-inspired programs are now available for several different host languages.

So, the basic process for developing software will be this. We'll craft a definition file that describes, using regular expressions, the tokens that should be observed among the stream of input. Corresponding to each of these tokens, we'll provide segments of C code, or lex macros, that should be executed when the token is recognized.

It is often the case that lexer developed using lex will be used by a parser developed using yacc, the subject of Thursday's conversation. In this model, the parser will ask the lexer for the next token, then interprete, taking any requisite actions, and ask for another one.

I guess it should be notes that, when the analyzer is generated, it is placed within a function called yylex. This can be placed within a main() method and execute until completion, or be designed to return one token at a time and, for example, be used to feed a parser. Other configurations are possible and commonplace.

A Lex Specification

A lex specification consists of three sections: definitions, rules, and subroutines. These sections are separated within the specification by the delimiter %%:

definition section

%%

rule section

%%

user subroutines

Of these sections, the defintions and user subroutine sections are options. As a result, a minimal specification is structured as below. Note the need for the initial %% to separate the empty defintion section from the rules section -- also note the absense of the delimiter after the rules section and before the empty user subroutine section:

%%

rule section

Lex rules

Since the most minimal lex specification contains only a rules section, which might contian only a single rule (I actually can't tell you if no rules are allowed -- I've never had any reason to use lex as a pass-through filter), let's begin by discussing rules.

A rule has two parts: the pattern and the action. The pattern is nothing more than a regular expression. The action is nothing more than a sniped of C code, or a lex macro (much like a C language macro) that is executed when a match is found for the regular expression.

Basically, the lexer processes the input stream searching, from beginning to end, for matches. Each time a match is found the corresponding action is taken.

If more than one pattern matches the same input, only one of the corresponding actions will be taken, that corresponding to the longest match. In the event of a tie by length, the rule appearing first within the lex specification is selected.

The following is a trivial, but perhaps useful, lexer. It recognizes carriage returns and replaces them with new lines.


  %%
  "\r"	printf "\n"
  

Lex definitions and Subroutines

The definitions section of a lex specification is used to make definitions that are required for the rest of the lexer. These definitions might be code snipets in the host language, or lex macros. Let's consider the following example, whcih counts words in text:


%{
  int wordCount = 0;
  int lineCount = 0;
%}

word    [^ \.,;:<>\t\n\r]+
eol     \n

%%

{word}          { wordCount++; }
{eol}           { lineCount++; }

%%

  int main(int argc, char *argv[]) {
    yylex();
    printf ("%d words, %d lines\n", wordCount, lineCount);
  }

  

The example above illustrates the use of each of definitions and subroutines. The definition section shows both the inclusion of C code and also the use of a lex macro.

Notice that the C language code is contained within a %{ ... %} block. Notice also that we used macros to define a word and the end of line -- this made these easier to use later. It would also make them easier to use and re-use in bigger examples.

The main() method, which will become part of the final executible was defined within the subroutine section. The subroutines here can also be helper methods to be called by the actions.

Notice the use of yyparse(). This is the method that encapsulates the parser, itself. Another method that might be of interest is yywrap(). It is called when the parser hits the end of a file.

yywrap() is predefined to return 1 -- indicating that lex should stop scanning. But, it can be redefined to do toher things (usually first).

Some versions of lex define yywrap() as a macro. This isn't done by either current or present versions, but by some in between. Regardless, if your lex hassles you over this, the fix is to undefine it:

  
%{ #undef yywrap %}

Below is one more example. It recognizes and prints words. What to note? Take a look at how yylex() is used to get one token at a time. Look also at how the predefined global varibale yytext is used to get at the value of the recognized token.

  
%{ #define WORD 1 #define TAG 2 %} word [^ \.,;:<>\t\n\r]+ tag <.*> %% {word} { return WORD; } {tag} { return TAG; } %% int main (int argc, char *argv[]) { int type; while ( type = yylex()) { if (WORD == type) printf ("Word: %s\n", yytext); if (TAG == type) printf ("Tag: %s\n", yytext); } }

Runnning Lex and Compiling

Let's take a quick look at how to make the tools go:

  
lex wordcount.l # generates lex.yy.c gcc -ll lex.yy.c # compiles to a.out, note the -ll to include lex libraries

Using yacc

yacc, yet another compiler compiler is a tool for writing a parser. In a typical case, lex tokenizes the input and yacc parses the tokens, taking the right actions, in context.

yacc can't parse any language -- no parser can. And, the technique it uses is far from the most powerful known in computer science. But, yacc can parse LALR(1) lanaguges -- a collection of languages sufficient for the overwhelming majority of the needs we have in programming and other human-computer interface languages. By limiting its scope to this class of language, yacc is able to generate smaller tables and operate more quickly.

Because yacc is an LALR(1) parser, we'll need to represent our language through rules that can be parsed from left-to-right without the need to look more than one token ahead of the present token.

In the overwhelming majority of cases, this presents no problem -- most languages we'll try to create are LALR(1). In other cases, yacc provices the ability to add precendence rules that can let it work around a few somewhat routine types of sticky spots. But, it is possibly to cook a language that yacc can't handle. And, more commonly, we might have to think about about how to represent some langugaes in a way that it can.

The Structure of a yacc Parser

The structure of a yacc parser definition is much like that of a lex definition. In fact, the format of the lex specification was based on yacc. The bottom line is that it has the same three basic sections: definitions, rules, and subroutines (verbatim).

A Simple Example

Below is a simple example that includes the defintion of two tokens and the rules of a grammar.

The token definitions get placed ito a file called y.tab.h, which can be #incuded into lex's definitions section.

  
%token NAME NUMBER %% statement: NAME '=' expression | expression { printf ("=%d\n", $1); } ; expression: expression '+' NUMBER { $$ = $1 + $3; } | expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ;

expression1.y


$$ represents the value of the left hand side. $1 - $n represent the values of the tokens on the right hand side. We set the value of $$ within yacc. The value of the tokens ($1 - $n) is set in lex by assigning a value to yylval, before returning.

An Ambiguity

Let's consider the following example. Is a horse a work_animal, or a cart animal?
  
phrase: cart_animal AND CART | work_animal AND PLOW cart_animal: HORSE | GOAT work_animal: HORSE | OX

animals-broken.y


Well, let's construct two examples:

Well, given the whole of either example -- we can figure it out. If it involves a PLOW, it is a work_animal. If it involves a CART, it is a cart_animal.

But, the problem for yacc is that, at the time is sees HORSE, it sees two different reductions: to work_animal and to cart_animal. So, it looks ahead by one -- to AND. But, this doesn't help. Either can still work. So, it is stuck. If it could look ahead just one more, to CART or PLOW, it could figure it out -- but that would be LALR(2), not LALR (1). yacc can't handle this -- it is a reduce/reduce conflict.

A Corrected Representation

After a few minutes of thought and experimentation, the class cooked up a new representation of the language. This time, it is LALR(1) and parseable by yacc.

  
phrase: cart_animal CART | work_animal PLOW ; cart_animal: HORSE AND | GOAT AND work_animal: HORSE AND | OX AND

animals-fixed.y


By moving the token AND from the "phrase" reduction to each of "cart_animal" and "work_animal" reductions, we are able to delay the "phrase" reduction -- until one token of look-ahead is sufficient.

Precedence

Consider the grammar below. It represents simple mathematical expressions. It gets things right, almost. Do you see the detail it isn't capturing?

  
%token NUMBER %% expression: expression '+' mulexp | expression '-' mulexp | mulexp ; mulexp: mulexp '*' primary | mulexp '/' primary | primary ; primary: '(' expression ')' | '-' primary | '+' primary | NUMBER ; expression-broken.y

The problem is that yacc is left-associative -- so, it processes expressions left to right. Without help, it doesn't capture the precedence among operators. We generally expect * and / to be evaluated before + and - But, fortunately, yacc does provide a way to set associativity and precedence. It does this with three directives which are placed in the definitions sections:

These directives can be used to dictate the associativity (left or right) of an operand. And, by listing them in order, they also define the precedence of the operators. Reread the last sentence: order matters! Those operands listed first have a lower precedence.

For an example, here's a fixed version of the grammar above. Notice that + and - are listed before * and /. It also provides an example of %nonassoc. This is used for unary operators. It indicated neither left nor right precedence, but is used to assign the operator to the right precedence. In this case, by assigning the unary - (negative) operator a higher precedence, we ensure that a number's negative magnitude is known, before it is evaluated as part of an expression.

  
%token NUMBER %left '+' '-' %left '*' '/' %% expression: expression '+' mulexp | expression '-' mulexp | mulexp ; mulexp: mulexp '*' primary | mulexp '/' primary | primary ; primary: '(' expression ')' | '-' primary | '+' primary | NUMBER ; expression-fixed.y