Kouprey is a parser combinator library for JavaScript. It can be used to
create and run parsers in any JavaScript environment. It requires no
functionality beyond that specified in ECMAScript.
1.2 Why "Kouprey"?
A "kouprey" is a ox common in Cambodia. Think "yacc" and "bison".
1.3 How does Kouprey work?
Kouprey takes a grammar, specified in JavaScript, and creates a Parsing
Expression Grammar (PEG) parser. The created parser is then used,
presumably, to parse whatever it is you wanted to parse.
Parsers created with Kouprey are reusable. That is, you specify the grammar
once, create the parser once, and then parse as many pieces of input as
desired.
1.4 What do I need to know to use Kouprey?
Obviously, you need to know JavaScript. It is also assumed that you have a
basic understanding of parser theory. The grammars passed to Kouprey are
expressed in something very much like Extended Backus-Naur Form (EBNF), so
understanding how that works will go a long way towards using Kouprey
effectively.
Tutorials and explanations on JavaScript/ECMAScript, parser theory, PEGs,
and grammar definitions are beyond the scope of this documentation.
1.5 Is Kouprey usable for foo?
Probably. Kouprey is powerful enough that it has been used to parse complete
programming languages. An example parser for the Component Pascal language
is included in the distribution. Additionally, Kouprey is an underlying
technology in Esel,
Jenner, and
Pixaxe.
This question could almost be restated as "are parsing expression grammars
usable for foo?" A full discussion of PEGs is beyond the scope of this
document, but, in short, yes. PEGs are strictly slightly less powerful than
Context Free Grammars (CFG), in that PEGs cannot handle left-recursive
grammars where a cycle of productions does not consume input. The good news,
though, is that any left-recursive grammar can be rewritten to be right-
recursive, so, in practice, a PEG can parse anything a CFG can parse.
PEGs also have several advantages over CFGs:
PEGs parse in linear time and quadratic space. CFGs parse in polynomial time.
Some grammars that introduce ambiguity in CFGs are handled without
ambiguity in PEGs.
PEGs generally do not require a tokenization step.
PEGs are arguably conceptually simpler.
2. A Quick Example
The fastest way to learn, well, anything, is to start with an example. Below
is a grammar specified in something like EBNF for simple four-function
arithmetic, with operator precedence:
var Grammar = {};
with (com.deadpixi.kouprey) {
with (Grammar) {
Grammar.Constant = Forward("Constant");
Grammar.Factor = Forward("Factor");
Grammar.Term = Forward("Term");
Grammar.Expr = [Any([Term, Or("+", "-")]), Term];
Grammar.Term = [Any([Factor, Or("*", "/")]), Factor];
Grammar.Factor = Or(Constant, ["(", Expr, ")"]);
Grammar.Constant = /^\-?(0|[1-9][0-9]*)(\.[0-9]+)?/;
}
}
var parser = new com.deadpixi.kouprey.Parser(Grammar, "Expr");
Note that this grammar does not account for whitespace, though such support
would be trivial to add.
Let's look at the grammar in a bit more detail:
var Grammar = {}
To create a Kouprey parser, one passes a grammar to the constructor
function. Here we simply create a object to store the grammar.
with (com.deadpixi.kouprey) {
with (Grammar) {
The JavaScript with statement is much maligned and misunderstood. The
with statement places an object at the front of the variable resolution chain.
Any variable lookups start by first checking for a property of that object
by that name, before proceeding along the normal resolution chain (function
scope and then global scope).
These two lines ensure that all of the various Kouprey parser combinators are
available. While it would certainly be possible to write
com.deadpixi.kouprey.Or every time you wanted to create an alternation,
writing Or is much easier.
We're also placing the grammar object we just created into the resolution
chain. This allows us to refer to already-created rules by name. Once again,
we could simply type Grammar.Name every time, but this makes things
much easier.
In other words, though these two with statements are not required, they are
used by convention to make the grammar specification much prettier.
These lines contain the forward declarations for the grammar. No rule
can be referred to in the specification of another rule unless it has already
been declared. By declaring these rules here, without fully specifying them,
we can refer to them in other rules and provide a full specification later.
This also lets us create rules which are recursive, referring to themselves.
There is no practical penalty in performance for forward declarations.
Grammar.Expr = [Any([Term, Or("+", "-")]), Term];
This line is real the real interesting stuff starts. This line declares a
production called Expr. It then uses four Kouprey features to define
the production:
An array. Arrays are the basic grouping construct in Kouprey grammars.
Rules contained with an array must exist in sequence within the input stream.
Combinators. In this case, the two combinators are Any and Or.
The Any combinator takes a rule and states that it can repeat between zero
and infinity times. The Or combinator takes a collection of rules as arguments
and tries each in order, returning the result of the first one that matches.
Rules. The Term rule refers to the Term rule, defined later on in the
grammar. Note that, because it hasn't been formally specified yet it had to be
placed in a forward declaration at the beginning of the grammar.
Terminals. The two strings "+" and "-" are terminal symbols. They match
themselves literally.
These lines continue the grammar definition. The final line shows that
regular expressions can also be used as terminals.
var parser = new com.deadpixi.kouprey.Parser(Grammar, "Expr");
This line actually creates the parser. It uses the
com.deadpixi.kouprey.Parser constructor function, passing it the
grammar object we created and in which we stored our productions. The second
argument specifies the root production of the grammar, the production that the
grammar as a whole must try and satisfy.
This parser could then be used like this:
var output = parser.parse("1+2");
This grammar definition is lacking in a few areas that would make it truly
useful. For one, it does not handle whitespace between tokens. For another,
the parser created using this grammar is just a recognizer: it will simply
determine whether a given input is accepted by the grammar or not. Later on
in the documentation we will discover how to make the parser handle whitespace,
and how to make the parser return an abstract syntax tree rather than a
simple parsed string (or null if the input was not accepted).
3. Grammar Definitions
Grammars in Kouprey are just JavaScript objects containing properties. Each
property is a production in the grammar. Each production is specified as
a JavaScript statement. Four different kinds of statements are used in
Kouprey grammar definitions:
Strings. A string in the grammar definition matches that string exactly at
the current point in the input stream.
Regular expressions. A regular expression in the grammar definition matches
that expression at the current point in the input stream. Note that
regular expressions must be left-anchored (that is, their first operation
must be "^"). Otherwise, there are no restrictions on the functionality of
regular expressions.
Functions. Functions in this case are parser combinators provided by Kouprey.
They provide constructs like repeated rules, alternations, and assertions.
Arrays. An array groups the rules it contains and enforces that they appear
in the order specified in the input stream.
Strings and regular expressions are the terminals of the grammar, while
functions and arrays are nonterminals. Nonterminals may be nested to
arbitrary depth and complexity.
3.1 Terminals
In Kouprey, strings and regular expressions represent terminals in the grammar.
They match themselves in the input stream.
The simplest possible grammar would be a single terminal. This grammar
would only match the string specified. For example, this is a well-formed,
but rather useless grammar:
var Rules = {
Root: "Hello, World"
};
This grammar would match literally the string "Hello, World" anywhere in the
input stream.
Regular expressions are also used to represent terminals. The only restriction
is that all regular expressions must be left anchored (i.e. they must begin
with a "^"). Otherwise, there are no restrictions on what may be done with
regular expressions.
3.1.1 Unicode Terminals
Kouprey fully supports Unicode, insofar as ECMAScript does. Unicode characters
may be placed into strings and regular expressions, used as variables names,
and so on.
ECMAScript, however, does not provide a standard way to match useful ranges
of Unicode characters, other than by creating character classes in regular
expressions.
Kouprey therefore provides several helper functions that produce regular
expressions that match useful Unicode character ranges. These functions are
named u and U.
The lowercase u function takes a single argument, a string, and
returns a single regular expression that will match a character in the
Unicode block specified by the argument. The uppercase U function
returns the inverse of what would have been returned by a call with the same
argument to the lowercase u function.
The argument to these two functions should be either a Unicode category,
or a block name.
The following categories are recognized:
C - Other
L - Letter
M - Mark
N - Number
P - Punctuation
S - Symbol
Z - Separator
If the argument is a block name, it must be one specified in the Unicode
documentation here, prefixed
by the letters "In" (e.g. "InHalfWidthAndFullWidthForms").
Block names are not case sensitive. Spaces, underscores, hyphens, and carets
can be inserted anywhere; these will be stripped out before use.
As a quick example, this sample grammar rule would match any sequence of
Unicode letters and numbers that begins with a letter:
[u("L"), Any(Or(u("L"), u("N")))]
3.2 Combinators
The real power of Kouprey comes from using parser combinators. Parser
combinators are functions that take a rule (or collection of rules) and returns
a new function that is then used to test input.
The following combinators are provided in Kouprey. They are in the
com.deadpixi.kouprey object, so by convention this object is placed
in the variable resolution chain using a with statement.
3.2.1 And (rule)
The And combinator is a type of assertion. It consumes no input.
Instead, it parses according to the rule passed into it and returns whether or
not that rule is true at the current offset into input, but does not consume
any input in the course of parsing.
And is also known as a lookahead assertion.
For example, this rule would match the word "Hello" only if it is followed by
"World", but the "World" portion would not be included in the match:
["Hello", And("World")]
3.2.2 Any (rule)
The Any combinator ensures that the rule passed to it is matched
any number of times, including zero times, in a row. The combinator matches
the longest possible sequence of symbols, and all of them are included in the
returned match.
For example, this rule would match any number of decimal digits, including no
digits:
Any(/^[0-9]/);
3.2.3 Balanced (beginning, end, esc)
The Balanced combinator takes two or three string arguments. It consumes
the longest possible sequence that is surrounded by a balanced number of
beginning and end symbols.
In other words, a call like this:
Balanced("(*", "*)")
Would match a Pascal-style comment, and would also match comments nested within
comments.
If the third argument, esc, is provided, then beginning and ending
sequences may be escaped by that sequence. In other words, any beginning or
ending sequence prefixed by the esc sequence are not counted for
the purposes of balanced delimiters.
3.2.4 End ()
The End combinator matches the end of input. Therefore, the entire
pattern up until the invocation of this combinator must be the last thing in
the input stream.
For example, this rule would match the world "Bye", if and only if it is the
last thing before the end of input:
["Bye", End()]
3.2.5 Forward (name)
The Forward function isn't really a combinator. It doesn't appear in
grammar production definitions. Instead, it is used at the beginning of a
grammar to declare rules by name, in case they must be referred to before
being defined.
The Forward function returns a function that looks up the rule at parse time.
Therefore, the output of the Forward function must be assigned to a rule of
the same name that it is declaring. For example, to declare a rule named
"Foo", using the rules dictionary "Rules":
Rules.Foo = Forward("Foo");
3.2.6 Not (rule)
The Not combinator is the logical negation of the And combinator.
It consumes no input, but it fails to match if the rule passed to it
does match at the current offset.
For example, this rule will match the word "Hello" if and only if it is not
immediately followed by the world "Sailor". Note that "Sailor" is not consumed
and is not part of the match:
["Hello", Not("Sailor")]
A common application of the Not combinator is making sure variable names are
not also keywords. For example:
Though this looks odd, it does work. This says "match any sequence of letters
and numbers starting with a letter, so long as the first part of that is not
a keyword, unless that keyword is immediately followed by another character."
The reason why that second Not is in there is to ensure that we don't declare
something like "functions" to be a keyword, since while it starts with
"function" there are extra characters at the end.
3.2.7 Optional (rule)
The Optional combinator states that the rule passed into it may match
either zero or one times, but no more.
For example, this rule will match "Hello, World" or "Hello World":
["Hello", Optional(","), " World"]
3.2.8 Or (rule1, rule2, ..., ruleN)
The Or combinator takes an arbitrary number of rules, and tries them
each in order. The first one to match is returned.
The Or combinator is probably the most important combinator, as it is used to
indicate precedence, not just of rules, but of contructs in the parsed input.
Precedence is indicated by ordering of rules passed to the combinator.
It is important to remember that the first rule to match is returned.
Therefore, one must be careful to not put more general rules in front of more
specific rules. For example, this would be a bad idea:
// Don't do this.
Or("func", "function");
If this rule were used on an input sequence like this:
function (foo) { return foo + 1 };
Parsing would fail. That is because the "func" rule would match, leaving the
input remaining as:
tion (foo) { return foo + 1};
Depending on how the rest of the grammar were written, this would probably be
the Wrong Thing.
A good, if simple, example, this grammar would match a traditional greeting in
both English and French (or, in theory, a combination of the two):
Note the optional whitespace being consumed by the regular expression.
3.2.9 Some (rule)
The operation of the Some combinator is identical to that of the
Any operator, except that it requires at least one occurance of the
rule at the current position in input.
In other words:
// This:
Some("A")
// Is exactly equivalent to this:
["A", Any("A")]
3.2.10 Start ()
This combinator matches the start of the input. That is, it anchors the
grammar to the beginning of the input stream.
For example, this grammar will match the word "Hello" if and only if it is the
first thing in the input:
[Start(), "Hello"]
3.3 Arrays
Arrays are the sequencing construct in Kouprey grammars. Rules contained in
arrays must match in the specified order. Moreover, all of the rules in an
array must match, or the whole rule fails.
For example, this rule matches an "A" followed by a "B":
["A", "B"]
Any rule can be placed in a sequence, including combinators. Therefore, complex
grammars like the arithmetic example above can be constructed.
4. Creating a Parser
The Kouprey parser constructor is a normal JavaScript constructor function,
com.deadpixi.kouprey.Parser. It is invoked like this:
new com.deadpixi.kouprey.Parser(grammar, root, [debug])
It takes two mandatory arguments:
grammar - A JavaScript object, containing a Kouprey grammar declared
as described above.
root - A string. The rule named by this string is the root rule, the
rule that the grammar as a whole must attempt to satisfy.
It takes one optional argument:
debug - A Boolean, defaulting to false. If it is set to true, the
parser stores debugging information during parsing. This information can be
retreived after parsing by calling the getDebugTrace method of the
parser object.
The com.deadpixi.kouprey.Parser constructor returns a parser object.
This object has only one useful method:
parse(input)
The input in this case is a string. Calling the parse function with a string
will invoke the parser using that string as its input. What it returns depends
on how the grammar is declared: see "A Parser's Output" below.
The parser object is reusable: it can invoked any number of times with
different input.
5. A Parser's Output
Depending on how a grammar is declared, a parser returns one of three things.
5.1 Failed Parsing
If the input fails to match a parser's grammar, the parser returns null.
5.2 Recognizers
If a grammar has no components marked as "interesting" (what "interesting"
means will be explained shortly), or if its root rule is not marked
"interesting", then any parser created using that grammar is more accurately
called a "recognizer". A recognizer simply parses its input and returns
the matching input as a string (or null if there is no match).
5.3 Abstract Syntax Trees
Kouprey provides a function, com.deadpixi.kouprey.$ that is used to
mark certain portions of the grammar as interesting.
The syntax for this function is:
$(rule, [name])
Here, the rule is any grammar rule. The $ function can appear anywhere in a
grammar.
When a rule marked as interesting, instead of returning the matching input
when a match occurs, it instead returns a Match object. The Match
object can be given a name, specified by the optional second parameter of the
$ function. If no name is provided, the rule is converted to a string using its
toString method and the result of that conversion is used as the name.
5.3.1 Match Objects
A Match object is returned by a rule that has been marked as interesting, if
that rule matches (if it doesn't match, it returns null).
A Match object has several interesting properties:
type - The type of the match. This is the string passed as the
second argument to the $ function, or the string representation of the
rule if no second argument was passed to the $ function.
matchlength - How many characters from the input matched the rule.
offset - The offset into the input that the match starts. The first
character of the input is at offset zero.
children - An array. The contents of this array are all Match
objects returned by the rules contained underneath this rule, but only if
those rules were marked as interesting themselves. If no rules underneath
the current one were marked interesting, the search is continued to their
children, and so on. This is explained further below.
value - If a rule is a terminal, this is the value of the terminal.
Match objects are only returned from rules marked as interesting. If a
nonterminal is marked as interesting, it only has interesting children or no
children at all.
That is, if none of the rules underneath an interesting rule are marked as
interesting themselves, they are not included in the children array. However,
any interesting grandchildren of the current rule will then be promoted to
the "children" array of the current match. If none of its grandchildren are
interesting, the great-granchildren are inspected, and so on. If no interesting
nonterminal rules are discovered, any terminal discovered becomes the
value of the "value" property of the match object.
Note that, due to this mechanism, a root rule must always be marked as
interesting, if any other portion of the grammar is marked as interesting, or
else the returned value will not be a Match object.
Notice how now the Noun and Greeting rules are marked interesting in and of
themselves. Let's run the parser using this grammar on the same input. Here's
the resulting Match object:
Notice the difference: Even though the "Greeting" rule wasn't marked as
interesting in the "Root" rule, it was still included. That's because it
was marked interesting in and of itself.
This second form means that all "Greeting" rules are interesting, regardless
of where they occur.
This Match object (which is also an Abstract Syntax Tree), could be passed to
a code generator or other useful piece of code.
Kouprey comes with several examples to illustrate how to use Match objects
effectively.
6. Conventions
6 Whitespace
Kouprey-generated parsers do not automatically insert or assume whitespace
anywhere. Whether or not this is considered a feature or a bug is up to you.
For the record, it was added because Kouprey was written to parse arbitrary
strings of symbols, not necessarily text. In non-text input streams, the
concept of "whitespace" becomes meaningless.
By convention, since whitespace tends to show up fairly often in grammars,
a rule consuming whatspace (whatever that means in your grammar's context)
is assigned to rules named "_" for optional whitespace and "$_" for mandatory
whitespace.
For example, this is a common sight at the top of Kouprey grammars:
Rules._ = /^\s*/;
Rules.$_ = /^\s+/;
6 The "with" Convention
As was discussed in the introduction, all of the functionality of Kouprey is
placed in the "com.deadpixi.kouprey" object. While you are more than welcome
to type this out at every invocation of a combinator in a grammar definition,
it is generally much better to use the JavaScript with statement.
The same goes for the rules dictionary. Sure, you could write out the
rule dictionary's name every time you needed to refer to it, but since most
grammar productions are going to refer to other rules, this can become
tedious. Therefore, the rules dictionary should also be placed in the
varible resolution chain using a with statement.
Compare:
// Using the "with" convention:
var Rules = {};
with (com.deadpixi.kouprey) {
with (Rules) {
Rules.Greeting = Or("hello", "bonjour");
Rules.Noun = Or("sailor", "marin");
Rules.Root = [Greeting, Optional(","), /^\s*/, Noun];
}
}
Versus:
// Without using the "with" convention:
var Rules = {};
Rules.Greeting = com.deadpixi.kouprey.Or("hello", "bonjour");
Rules.Noun = com.deadpixi.kouprey.Or("sailor", "marin");
Rules.Root = [Rules.Greeting, com.deadpixi.kouprey.Optional(","),
/^\s*/, Rules.Noun];
The choice is yours.
7. Performance Tips
Kouprey is not especially efficient, but neither is it pathologically
inefficient. The inefficiency primarily comes from three factors:
It's written in an interpreted language, running in a dynamic environment.
It creates a lot of strings.
It invokes a lot of functions.
There's not much that can be done about the first two points, but the third
one can be mitigated somewhat. Every combinator used results in at least
one function invocation, and possibly very many. Therefore, it is advised to
avoid combinators when they are not necessary.
For example, instead of doing something like this:
Any(/^\s/);
It's much (much!) faster to do this:
/^\s*/
The general rule of optimization for Kouprey grammars is: try to match as much
as possible in regular expressions and string terminals.
8. Frequently Asked Questions
8.1 Why don't Kouprey grammars work on the iPhone/Other Small Device?
The parsers created by Kouprey involve a lot of recursion - this is not so
unusual; most parsers are highly recursive.
However, small devices like the iPhone limit the call stack of JavaScript
programs. On some devices, the call stack is only large enough for fifty
nested calls!
Even the simplest grammars can overrun the limited stack on these small
devices.
If you have any suggestions on getting Kouprey to work better on small
devices, please contact Rob with
information.
9. Contact Information, Copyright, Trademarks, Licensing, and References
9.1 Contact
The author of Kouprey is Rob King. He can be contacted at his home page,
Deadpixi.COM.
9.2 Contributions
Patches for Kouprey may be sent to Rob at his home page,
Deadpixi.COM.
Please note that patches may not be accepted. To be accepted, the patch
author must transfer copyright to Rob King, to ensure continuity
and correctness of licensing.
9.3 Trademark and Copyright
It's silly that this even needs to be said, but if I don't put these here,
I'd have no recourse if someone came along and decided to name their parsing
library "Kouprey":
"Kouprey" is a trademark of Rob King.
The implementation of Kouprey is copyright 2009 Rob King.
9.4 Licensing
The Kouprey implementation is licensed under the terms of the GNU Lesser
General Public License. A copy of the license is available
here.
If you wish to license Kouprey under different terms, please contact the
author at his home page,
Deadpixi.COM for negotiations.