"Linux Gazette...making Linux just a little more fun!"

Compiler Construction Tools

By Richard A. Sevenich

Part I:  - JFlex and CUP

by Richard A. Sevenich, Department of Computer Science, Eastern Washington University
March 25, 1999

Traditionally, in the UNIX world, there are two complementary compiler construction tools which are available:

These tools are freely available in the GNU/Linux world, usually free, in some cases licensed under the GPL. It should be pointed out that lexical and syntactic analysis are two of the primary jobs to be performed by a language translator. A lexer and parser built with the above tools do not automatically accomplish a third crucial task, target code generation. However, these tools provide the programmer with convenient 'hooks' for incorporating target code generation.

Later in this series of articles I hope to introduce two of these tools, JFlex and CUP, in a tutorial setting. I will enlist the help of several students in my compiler design course as coauthors. Ultimately, I hope to persuade those unfamiliar with these tools that they are very practical. I've chosen JFlex and CUP because they produce java code and it is high time for me to learn some java. This unfamiliarity with java will also provide me with a scapegoat when I do something stupid - I can blame it on that unfamiliarity.

This first article will provide background for the series. The next article, which should follow fairly soon, will show how/where to get these tools, give a very specific installation scenario, and produce a simple application as a development example. A third article will give a practical real world example (to be described below). If the third article becomes unwieldy, it may be broken  into parts.

The Lexical Analyzer (a.k.a. 'lexer' or 'scanner')

Language translation converts source code from some language into target code in some other language. The 'traditional' compiler may convert source code into assembly language or even machine code - although the later articles in this series will focus on other targets than these. The first task of language translation is akin to examining an English essay to make sure that the words are spelled correctly. The lexer performs this job on our source code by recognizing a series of contiguous symbols as valid or not e.g. the lexer might The lexer is analogous to a spelling checker for a source program.

A utility such as JFlex builds a lexer from a specification file the programmer writes to define the 'words' (lexical tokens) in the desired language. Let's say the programmer defines a new langauge called pronto and writes a file 'pronto.flex' which defines valid lexical tokens for pronto. Then the command line operation 'JFlex pronto.flex' will produce a java version of a lexical analyzer, say, "Lexer.java'.

In its most primitive deployment, the lexer merely indicates that the source file consists of all valid lexical tokens or not - a boolean yes or no. The family of lexers under discussion are built to do more and, in particular, can cooperate with the parser (to be discussed under the next heading). In the typical application the lexer is, in fact, called by the parser and the lexer can do these jobs:

The first of the three items above allows the lexer to support the parser's central task, syntactic analysis. The other two items are especially useful in helping the parser in ultimately generating target code.

The Syntactic Analyzer (a.k.a. the 'parser')

Continuing the analogy that began the previous section, just as the lexer checks words for spelling, the parser examines the source to assure that the 'words' are arranged in valid grammatical constructs. For example, for some programmer's new language the lexer might pass these six valid tokens to the parser:
{  if  + while ] } - the lexer only worries about token validity, not the arrangement.
However, the parser sees the arrangement '{ if + while ] }' as invalid. Just as the lexer is a source code spelling checker, the parser is a grammar checker.

[ Note: The compilerati will realize that the lexer is actually checking the source code validity against a very simple 'regular grammar' specification and that the parser is checking the source code against a less simple 'context free grammar' specification - as defined in the Chomsky hierarchy. Typical compiler design books discuss the theory at length.]

A utility such as CUP builds a parser from a specification file the programmer writes to define the syntactic structure in the desired language. For the fictitious new language called pronto, the programmer might write the specification file as 'pronto.cup'.  Then the command line operation 'java java_cup.Main < pronto.cup' will produce several files one of which is a java version of the desired syntactic analyzer, "parser.java'.

In its most primitive deployment, the parser indicates that the source file is either grammatically correct or not - again, a boolean yes or no. The family of parsers under discussion can do an additional task - whenever a valid grammatical construct is found, perform any code that the programmer cares to encode. This 'hook' is typically used to support target code generation in two execution styles:
generated code written to a file to be executed later
generated code to be executed while the parser operates

Application Specific Languages

The compiler construction tools under discussion can be used to develop a full-blown language translator e.g. for C, Pascal, FORTRAN, Perl, etc. These would comprise major development projects. Here I'd like to discuss translators for 'Application Specific Languages', typically a more modest project, yet quite useful. I'll define an 'Application Specific Language' (ASL) operationally, by means of two examples.

Example 1 - A generalized industrial control language

Let's say Fred works for a company that produces industrial controllers, which are driven by a computer. When Fred is hired, the company has already developed and deployed a powerful, general pupose control language based on generalized, parallel state machines. However, customers must become programmers to use the controller; customers formally trained as chemical engineers, mechanical engineers, technicians etc. with little desire or time to learn a new general purpose programming language. The product is very general pupose, useful in many niche industries - automotive, petroleum, logging mills, satellite control, etc.

Fred has been hired to put a front end on the language for every one of the exploitable niche markets. In each case, the front end is to be tailored to the terminology used by the niche market customer and to be easy to use. The front end might be of the 'fill in the blanks' variety, a GUI, whatever. The front ends are really new languages all with the same target language (the general purpose control langauge). Each front end exemplifies an ASL.  By using the compiler construction tools (e.g. JFlex and CUP), Fred capitalizes on several benefits including:

Example 2 -  A generalized Fuzzy Logic based decision package

Fuzzy Logic has proved useful, not only in its traditional role in industrial control, but also in a decision making role. It can be used to play the stock market (and lose money more elegantly?), to choose from among a corporation's research or marketing strategies, to aid in avalanche prediction, etc.

Let's suppose that Fred's significant other, Sally, has programmed a general pupose Fuzzy Logic decision maker. To apply it to different problems it is initialized from appropriately different initialization data files. Sally markets this product to various niches, but finds former customers a constant burden on her time. The problem is really inherent in the way Fuzzy Logic works. The customer is the expert in his/her particular problem space.  A Fuzzy Logic model is initialized by incorporating the expertise of the user. The user gains more expertise as the model is used and will constantly want to tweak the model. The crux of Sally's problem is that the initialization file must be prepared with great programming care. The customers are not programmers and easily make mistakes (most likely syntactic) in preparing such a data file. They always run into problems, call on Sally for help, and expect her to do it at a fairly low margin. She must respond to maintain her reputation and, hence, her financila success.

Her solution is to make an 'idiot-proof' front end that will automate writing the initialization data file. The front end is tailored to the niche customer's terminology and problem space. The front end is an ASL with the initialization data file as the target language. The translator can be written with the help of the compiler construction tools, just as done by Fred for the industrial control scenario. The translator guarantees that the lexical and syntactic content of the data file is correct.

If the front end is cleverly defined the customer will find it useful. Note that the customer is an expert on the problem semantics, where programmer Sally would be weakest. The customer will solve the semantic problems, and the ASL translator will avoid the lexical and syntactic problems related to the initialization data file.


The two preceding examples have an obvious common thread. It should be emphasized that in designing the front end ASL's, Fred and Sally must interact strongly with the niche customers. After all, the ASL is to be useful to a programming novice who, at the same time, has expertise in the problem space. If this interaction fails, the ASL will not be well received and may fail in its intended market.

The Fuzzy Logic example is, in fact, the course project for this quarter's compiler design class - assuming Sally doesn't beat us to it.

Copyright © 1999, Richard A. Sevenich
Published in Issue 39 of Linux Gazette, April 1999