The parser

The parser creates AST nodes that are described in the AST section.

Thanks to the work we did in the tokenizer, the parser operates on a stream of tokens instead of dealing with individual characters. It still defines many helpers to keep complexity down. I'll discuss here the main functions that comprise the parser. Let's start with a high level one, the lambda parser:

function parse_lambda() {
    return {
        type: "lambda",
        vars: delimited("(", ")", ",", parse_varname),
        body: parse_expression()
    };
}

This function will be invoked when the lambda keyword has already been seen and “eaten” from the input, so all it cares for is to parse the argument names; but they're in parentheses and delimited by commas. Rather than placing that code in parse_lambda, I preferred to write a delimited function that takes these arguments: the start token, the end token, the separator, and a function which parses whatever must be between those start/end tokens. In this case, it's parse_varname, which takes care to throw an error if it encounters anything which doesn't look like a variable. The body of the function is an expression, so we get it with parse_expression.

delimited is a bit lower-level:

function delimited(start, stop, separator, parser) {
    var a = [], first = true;
    skip_punc(start);
    while (!input.eof()) {
        if (is_punc(stop)) break;
        if (first) first = false; else skip_punc(separator);
        if (is_punc(stop)) break; // the last separator can be missing
        a.push(parser());
    }
    skip_punc(stop);
    return a;
}

As you can see, it uses more utilities: is_punc and skip_punc. The former will return true if the current token is the given punctuation sign (without “eating” it), while skip_punc will ensure that the current token is that punctuation (throws an error otherwise) and will discard it from the input.

The function that parses the whole program is probably the simplest:

function parse_toplevel() {
    var prog = [];
    while (!input.eof()) {
        prog.push(parse_expression());
        if (!input.eof()) skip_punc(";");
    }
    return { type: "prog", prog: prog };
}

Since we have no statements, we simply call parse_expression() and read expressions until we get to the end of the input. Using skip_punc(";") we demand semicolons between these expressions.

Another simple example: parse_if():

function parse_if() {
    skip_kw("if");
    var cond = parse_expression();
    if (!is_punc("{")) skip_kw("then");
    var then = parse_expression();
    var ret = { type: "if", cond: cond, then: then };
    if (is_kw("else")) {
        input.next();
        ret.else = parse_expression();
    }
    return ret;
}

It skips over the if keyword with skip_kw (and this throws an error if the current token is not the given keyword), reads the condition using parse_expression(). Next, if the consequent branch doesn't start with a { we require the keyword then to be present (I feel like the syntax is too scarce without it). The branches are just expressions, so again we use parse_expression() for them. The else branch is optional so we need to check if the keyword is present before parsing it.

Having many small utilities helps a lot in keeping the code simple. We almost write the parser like we had a high level language dedicated for parsing. All these functions are “mutually recursive”, e.g.: there's a parse_atom() function which is the main dispatcher — based on the current token it calls other functions. One of them is parse_if() (called when the current token is if) and that in turn calls parse_expression(). But parse_expression() calls parse_atom(). The reason why there's no infinite loop is that at each step, one function or another will advance at least one token.

This kind of parser is called a recursive descent parser and it's probably the easiest kind to write manually.

Lower level: parse_atom() and parse_expression()

parse_atom() does the main dispatching job, depending on the current token:

function parse_atom() {
    return maybe_call(function(){
        if (is_punc("(")) {
            input.next();
            var exp = parse_expression();
            skip_punc(")");
            return exp;
        }

        // This is the proper place to implement unary operators.
        // Following is the code for boolean negation, which is present
        // in the final version of lambda.js, but I'm leaving it out
        // here since support for it is not implemented in the interpreter
        // nor compiler, in this tutorial:
        //
        // if (is_op("!")) {
        //     input.next();
        //     return {
        //         type: "not",
        //         body: parse_atom()
        //     };
        // }

        if (is_punc("{")) return parse_prog();
        if (is_kw("if")) return parse_if();
        if (is_kw("true") || is_kw("false")) return parse_bool();
        if (is_kw("lambda") || is_kw("λ")) {
            input.next();
            return parse_lambda();
        }
        var tok = input.next();
        if (tok.type == "var" || tok.type == "num" || tok.type == "str")
            return tok;
        unexpected();
    });
}

If it sees an open paren, then it must be a parenthesized expression — thus, skip over paren, call parse_expression() and expect a closing paren. If it sees some keyword, it calls the appropriate parser function. If it sees a constant or an identifier, it's just returned as is. And if nothing works, unexpected() will throw an error.

When an atomic expression is expected and it sees {, it calls parse_prog to parse a sequence of expressions. That's defined below. It will do some minor optimization at this point — if the prog is empty, then it just returns FALSE. If it has a single expression, it is returned instead of a "prog" node. Otherwise it returns a "prog" node containing the expressions.

// we're going to use the FALSE node in various places,
// so I'm making it a global.
var FALSE = { type: "bool", value: false };

function parse_prog() {
    var prog = delimited("{", "}", ";", parse_expression);
    if (prog.length == 0) return FALSE;
    if (prog.length == 1) return prog[0];
    return { type: "prog", prog: prog };
}

Here's the parse_expression() function. Contrary to parse_atom(), this one will extend an expression as much as possible to the right using maybe_binary(), which is explained below.

function parse_expression() {
    return maybe_call(function(){
        return maybe_binary(parse_atom(), 0);
    });
}

The maybe_* functions

These functions check what follows after an expression in order to decide whether to wrap that expression in another node, or just return it as is.

maybe_call() is very simple. It receives a function that is expected to parse the current expression. If after that expression it sees a ( punctuation token, then it must be a "call" node, which is what parse_call() makes (included below). Notice again how delimited() comes in handy for reading the argument list.

function maybe_call(expr) {
    expr = expr();
    return is_punc("(") ? parse_call(expr) : expr;
}

function parse_call(func) {
    return {
        type: "call",
        func: func,
        args: delimited("(", ")", ",", parse_expression)
    };
}

Operator precedence

maybe_binary(left, my_prec) is used to compose binary expressions like 1 + 2 * 3. The trick to parse them properly is to correctly define the operator precedence, so we'll start with that:

var PRECEDENCE = {
    "=": 1,
    "||": 2,
    "&&": 3,
    "<": 7, ">": 7, "<=": 7, ">=": 7, "==": 7, "!=": 7,
    "+": 10, "-": 10,
    "*": 20, "/": 20, "%": 20,
};

This says that * binds tighter than +, so an expression like 1 + 2 * 3 must be read as (1 + (2 * 3)) instead of ((1 + 2) * 3), which would be the normal left-to-right order in which the parser operates.

The trick is to read an atomic expression (only the 1) and pass it to maybe_binary() (the left argument), along with the current precedence (the my_prec). maybe_binary will look at what follows. If it doesn't see an operator, or if it has a smaller priority, then left is returned as is.

If it's an operator that has a higher precedence than ours, then it wraps left in a new "binary" node, and for the right side it repeats the trick at the new precedence level (*):

function maybe_binary(left, my_prec) {
    var tok = is_op();
    if (tok) {
        var his_prec = PRECEDENCE[tok.value];
        if (his_prec > my_prec) {
            input.next();
            var right = maybe_binary(parse_atom(), his_prec) // (*);
            var binary = {
                type     : tok.value == "=" ? "assign" : "binary",
                operator : tok.value,
                left     : left,
                right    : right
            };
            return maybe_binary(binary, my_prec);
        }
    }
    return left;
}

Note that before returning the binary expression we must also call maybe_binary at the old precedence level (my_prec), in order to wrap the expression in another one, should an operator with a higher precedence follow. If all this is confusing, read the code again and again (perhaps try to execute it mentally on some input expressions) until you get it.

Finally, since my_prec is initially zero, any operator will trigger the building of a "binary" node (or "assign" when the operator is =).

There are a few more functions in the parser, so I'm including the whole parse function below. Click “Show code” to display it (about 150 lines).

Credits

The moment I understood how to write a non-trivial parser occurred while studying Marijn Haverbeke's parse-js library (Common Lisp). The parser above, although for a much simpler language, is modeled after his code.

Welcome! (login)