As I said, in this tutorial, we will build up a parser for four-function arithmetic, but we are going to start with a subset. We'll just handle addition for now -- the addition of positive decimal numbers to be exact. Our first parser will be able to handle:
-
A decimal number, such as 79, or 3.14, or 0.78
-
Two or more decimal numbers added together, such as 2+2, or 0 + 3.14 + 2.718, for example....
Those of you who are parser generation veterans will know what is needed here. Namely: on the lexical level, we need to define two kinds of token, the plus operator, and a regular expression that matches decimal numbers. On the syntactical level, we need a production that represents a sequence of one or more decimal numbers separated by + signs. Now, I am perfectly aware that for people without that background, the above two sentences will be gobbledygook. I certainly am not assuming that all readers know what is meant by the terms token, regular expression, or production. And I realize that the distinction between the lexical and the syntactical may be fuzzy in your mind at best. However, I see little point in defining these things at this point. The best way to demystify all of this is to jump into some actual code.
In your text editor, create a new file and insert the following code:
![]() | ![]() | ![]() | |
![]() |
| ![]() | |
![]() | ![]() | ![]() |
Even if you have never written a grammar before, it should be fairly obvious which parts of the grammar correspond to the various elements described above. PLUS and NUMBER are the two token types. The strange looking text that follows NUMBER is the so-called regular expression that matches decimal numbers. Further down, AdditiveExpression is the syntactical production that matches one or more numbers with intervening plus signs. Note, by the way, that, we also have a specification starting with SKIP. That says that whitespace characters should be skipped. This allows the parser to handle 2 +2 or 2+ 2, for example, since it ignores the spaces, in the input (as well as tabs, carriage returns, and linefeeds). Okay, without further ado, save the file as Arithmetic.freecc and bring up a command-line and cd into the directory where you saved the file. Now run:
freecc Arithmetic.freecc
This command should have generated a bunch of files in the current working directory. If it didn't, it is either because a typo slipped into your Arithmetic.freecc file or you don't have your environment set up as indicated, i.e. the java/javac/freecc programs on your system PATH.
Anyway, assuming the above command ran without error, you now have a whole bunch of generated java code in that directory. Oh, about 2000 lines of code in 13 files. As you can see, FreeCC generates quite a bit of boilerplate for even the most minimal grammar file. If you haven't done so already (for many, it is surely a reflex) then compile the code with:
javac *.java
Now, I would hope that you are at least a little bit curious about all this code that the little grammar file above is responsible for generating. While there's nothing wrong with simply bringing it up in your editor and looking at it, probably a better way to get an initial bird's eye view is to use the javadoc tool to generate some navigable API documentation. So, let's do that. On the command-line, enter:
javadoc *.java
And let's have a look. If we open package-tree.html in a web browser, we can get a view of the various classes that FreeCC has generated. As foreshadowing for the next section, where we write some code that uses the parser, let me point out the classes we are going to interact with. The class that is the actual parser is called, not surprisingly, ArithmeticParser, and lives in the file ArithmeticParser.java. We will also use the various classes named for the various elements in our grammar, Number, PLUS, and AdditiveExpression. Note that these three classes all implement the Node interface defined in Node.java. The AdditiveExpression implements Node because it is a subclass of BaseNode, which in turn implements Node. NUMBER and PLUS are subclasses of Token, which also implements the Node interface.
These various generated Node objects -- specifically in this case NUMBER, PLUS, and AdditiveExpression -- are the building blocks from which our generated parser will construct an AST (abstract syntax tree) which represents the parsed input.