UglifyJS — why not switching to SpiderMonkey AST
Yesterday I wanted to publish version 2 on NPM and make it official, but I hit this NPM issue. Seems it's now fixed, but I'm holding it off to finish a new feature that I'm working on (import the SpiderMonkey AST), so you still have to use the Git tree to play with it.
I completed the CLI tool with a bunch of new options and wrote the documentation in the README. It's now possible to keep certain comments in the output. Still no API documentation so far, but it'll come shortly (I'll actually have a dedicated section on this website for UglifyJS).
Next, some points about why I'm not switching to the SpiderMonkey AST internally. Tl;dr: I don't quite like it. Also, I'd kinda have to rewrite the mangler and compressor from scratch. That's Esmangle's job. Any case, UglifyJS will support the SpiderMonkey AST by converting it to our internal AST.
New kid in town: Acorn
Fellow hacker Marijn Haverbeke just published the fastest JS-parser-in-JS, named Acorn. The speed is daunting. In my tests it's 15-35% better than UglifyJS2 when locations and trackComments are on. Congrats to Marijn for this masterpiece!
Support for the SpiderMonkey AST
I thought it would be nice to provide some inter-operability with foreign AST structures, and since the Mozilla Parser API tree format became quite popular with the development of Esprima (and will be even more so when people start noticing how good Acorn is), I started working on an “AST importer”. It should be fully functional in a few days, meaning, UglifyJS will get a new command-line option to use Acorn or Esprima to do the parsing, and there will be an API that allows one to pass a “standard” AST and apply all the mangling/compression options on it.
First I thought that Acorn being so fast, it might worth to use it for parsing and transform the SpiderMonkey AST into UglifyJS2 AST; that doesn't seem to be the case though. Here's timing for jQuery:
parse_acorn: 0.150
parse_uglify: 0.187
transform AST: 0.083
So if I'd do the parsing via Acorn and transform the SpiderMonkey AST into the UglifyJS2 AST, It would be overally slower by about 50ms. However, I expect this feature to still be useful to folks who generate a SpiderMonkey AST somehow (for example CoffeeScriptRedux) and want to use UglifyJS's mighty compression powers.
Why not using the SpiderMonkey AST internally?
Some people suggested this, and I feel like I owe an answer. Yesterday's hack was a small step in this direction (well, not to using the SM AST internally, but at least being able to grok one). It's the first time I study the SpiderMonkey AST format and I already found a few things I don't like about it; but first and most important, in order to use it I'd have to rewrite the mangler and compressor from scratch (and let's say I'd switch to escodegen for code output). I hardly finished the first big rewrite of UglifyJS and I'm not in a mood to start another one.
No inheritance1
This is the biggest issue. UglifyJS heavily relies on a certain hierarchy among AST nodes. For example the AST_Return and AST_Throw nodes inherit from AST_Exit, which in turns inherits from AST_Jump. In certain parts of the code we deal with return and throw in the same way,
therefore we only need to check if node instanceof AST_Exit
. There are lots of similar situations, for
example AST_Assign is an instanceof
AST_Binary, and this helps here and there.
More important than checking for node features, in UglifyJS nodes can have methods and they can be overridden by
subclasses. Common OOP technique :p. That wouldn't work with flat objects, however. For example my code generator
inserts a "print"
method in every node's prototype, so to get the source code of any node I'd write node.print(stream)
. With a flat inheritance tree, that would have to be more like print(node, stream)
,
where the "print" function would be a big switch
block checking for the node type first. That's how
escodegen looks like — a few gigantic switches. My code generator is something like half the
size.
So any case: making my code work with flat objects would practically mean rewriting all of it. However, assuming Acorn or Esprima implement an inheritance structure, then refactoring UglifyJS to use the SpiderMonkey AST internally would become feasible. Hard work, but doable.
Glitches (IMO) in the SpiderMonkey AST format
Except for the inheritance part, the SpiderMonkey AST doesn't differ that much from UglifyJS2's AST structure. In both, a node is an object, it contains source code location info (and perhaps comments) and a few child nodes. However...
Abusing BlockStatement
I made this mistake in UglifyJS2 too, but I refactored it before the mess
spread out. In the Mozilla AST, a BlockStatement is used as the block and finalizer of try
node,
and as the body of catch
and function
nodes (but inconsistently, not for switch
nodes).
I think that's wrong; while there is a visual resemblance between a block statement and a function body, both
semantically and syntactically they are different things. A block statement is, well, a statement (it even inherits
from Statement in the Parser API spec), but a function
or a try
body is not quite a
statement; the block brackets are required there.
Where this makes a difference for my humble compressor is, UglifyJS defines these simple rules for compressing a block statement:
- Compress every statement in body
- Tighten the body (this involves joining consecutive
var
statements, dropping unreachable code, joining simple statements with a comma, etc.) - If body.length == 0, return an AST_EmptyStatement2.
- If body.length == 1, return body[0] — usually an AST_SimpleStatement.
- Otherwise, return this (that is, the AST_BlockStatement).
Rules 3 and 4 would break the output if a BlockStatement was used as body of a function
, try
,
catch
or finally
nodes. I wasn't happy with the possible workarounds for this issue. I'd have to
check either that the parent node is not a node where the block is required before applying 3 or 4, or handle a special
case in the code generator (if the body of a statement where brackets are required is not instanceof BlockStatement,
then write the brackets). It seemed like a better decision that the body of all these nodes is an array of statements,
and have the code generator always add the brackets; and the compressor needs not to worry about changing a block
statement to a simple/empty statement—it's always valid.
The SM AST got it right for SwitchStatement. It has a "cases" property which isn't a BlockStatement but an array of SwitchCase nodes. Perhaps it was designed this way because SwitchCase is not a Statement...
Less useful granularity
The SpiderMonkey AST has a UnaryExpression node which represents both prefix and suffix expressions, discerning between them with a "prefix" property. Not a big deal, but it's something of an annoyance; in UglifyJS we have AST_Unary which is the base class, and AST_UnaryPrefix and AST_UnaryPostfix inheritting from AST_Unary. So when necessary, we can treat both the same via the base class, but when we need to discern we have the subclasses.
Additionally, in SpiderMonkey there's a UpdateExpression with the same specification as UnaryExpression. Why having a different node type for this? Here's the specification for both:
interface UnaryExpression <: Expression {
type: "UnaryExpression";
operator: UnaryOperator;
prefix: boolean;
argument: Expression;
}
interface UpdateExpression <: Expression {
type: "UpdateExpression";
operator: UpdateOperator;
argument: Expression;
prefix: boolean;
}
So what differs is the operator; however, both in Acorn and Esprima, the operator is simply a string. I can't see any reason for having different node types.
Literals
On the other hand, there's a Literal node that represents: numbers, strings, true
, false
, null
and literal regexps. Maybe that makes sense for SpiderMonkey, but I would call for more node
types here. In UglifyJS the node hierarchy for these literals looks like this:
AST_Constant
└── AST_String
└── AST_Number
└── AST_RegExp
└── AST_Atom // I'm not using this particular class, could drop it
└── AST_Null
└── AST_Undefined
└── AST_NaN
└── AST_Boolean
└── AST_True
└── AST_False
That's more convenient; to define
the compression
rule for booleans, for example (which is simply to rewrite the tree to use !0 for true
and !1 for false
), I just insert the method in AST_Boolean and it returns the AST for !(1 - this.value). To use
something like the SpiderMonkey AST I'd have to switch (typeof node.value)
in a lot of places.
No “Directive” node
In UglifyJS, the parser tracks “directives” and creates instances of AST_Directive nodes3. It's kinda useful to have a dedicated node type. In the SpiderMonkey AST it's just an ordinary ExpressionStatement (that's the equivalent of AST_SimpleStatement in UglifyJS).
The compressor drops simple statements that don't have side effects. This is done in the compression method for AST_SimpleStatement itself. If the same node type would be used for directives, then the compressor would drop all directives and to work around this I would have to move the code that checks for side effects in the upper node, which could keep track of the statement's position (if at beginning of body, and if body is toplevel or function, then it's directive). That seems a lot more work than just having a dedicated node type.
Identifiers
In SpiderMonkey AST, we have the Identifier type, which is a node for a name, regardless if it's for access or definition. In UglifyJS2 there's some elaborate hierarchy for representing names:
AST_Symbol . . . . . . . . . . . // base class
└── AST_This . . . . . . . . // the "this" symbol
└── AST_SymbolDeclaration . . // symbol that introduces a name in scope
└── AST_SymbolVar . . . . // symbol introducing a variable
└── AST_SymbolFunarg . // symbol naming a function argument
└── AST_SymbolConst . . . // symbol defined with "const"
└── AST_SymbolDefun . . . // symbol that defines a function (statement)
└── AST_SymbolLambda . . . // symbol that names a function (expression)
└── AST_SymbolCatch . . . // symbol that names the exception in "catch"
└── AST_Label . . . . . . // symbol that defines a label
└── AST_SymbolRef . . . . . . // reference to a AST_SymbolDeclaration
└── AST_LabelRef . . . . . // reference to a AST_Label
It's pretty trivial to figure out at parse time which constructor to use, and the additional knowledge provided by this hierarchy is useful in the mangler and compressor:
- for once, the output code is only in the base class (no need to duplicate it, of course)
- the mangler can use
node instanceof AST_SymbolDeclaration
to determine if a symbol introduces a name in scope - the compression for a AST_SymbolRef can check if the declaration it refers to is
instanceof AST_SymbolConst
, and if it actually evaluates to a constant expression it replaces the reference with the expression.
But nothing's perfect
I'm not entirely happy with my AST either4, but I kinda like it more than the SpiderMonkey AST. My structures evolved as I needed change and it was actually nice to not be constrained by a rigid definition made for a different programming language. And now the code is written and it works quite well. I see no point to switch.
A day might come when a combination of Acorn/Esprima + Esmangle + Escodegen might eat my lunch, but I think that day won't happen
anytime soon. Esmangle seems to be far behind: it's big, slow, it produces bigger code than UglifyJS and it breaks my
scripts (UPDATE: Yusuke fixed the issues
I've noticed quite quickly :-). I hope people using the SpiderMonkey AST will use UglifyJS to mangle and compress it,
once this API will be ready.
Enjoy the mighty powers! :)
"use strict"
is in effect. I'll implement this sometime.