Colors: Light | Dark
Welcome! (login)

UglifyJS — why not switching to SpiderMonkey AST

Oct
4
2012

UglifyJS — why not switching to SpiderMonkey AST

Yesterday I wanted to publish version 2 on NPM and make it official, but I hit this NPM issue. Seems it's now fixed, but I'm holding it off to finish a new feature that I'm working on (import the SpiderMonkey AST), so you still have to use the Git tree to play with it.

I completed the CLI tool with a bunch of new options and wrote the documentation in the README. It's now possible to keep certain comments in the output. Still no API documentation so far, but it'll come shortly (I'll actually have a dedicated section on this website for UglifyJS).

Next, some points about why I'm not switching to the SpiderMonkey AST internally. Tl;dr: I don't quite like it. Also, I'd kinda have to rewrite the mangler and compressor from scratch. That's Esmangle's job. Any case, UglifyJS will support the SpiderMonkey AST by converting it to our internal AST.

New kid in town: Acorn

Fellow hacker Marijn Haverbeke just published the fastest JS-parser-in-JS, named Acorn. The speed is daunting. In my tests it's 15-35% better than UglifyJS2 when locations and trackComments are on. Congrats to Marijn for this masterpiece!

Support for the SpiderMonkey AST

I thought it would be nice to provide some inter-operability with foreign AST structures, and since the Mozilla Parser API tree format became quite popular with the development of Esprima (and will be even more so when people start noticing how good Acorn is), I started working on an “AST importer”. It should be fully functional in a few days, meaning, UglifyJS will get a new command-line option to use Acorn or Esprima to do the parsing, and there will be an API that allows one to pass a “standard” AST and apply all the mangling/compression options on it.

First I thought that Acorn being so fast, it might worth to use it for parsing and transform the SpiderMonkey AST into UglifyJS2 AST; that doesn't seem to be the case though. Here's timing for jQuery:

parse_acorn: 0.150
parse_uglify: 0.187
transform AST: 0.083

So if I'd do the parsing via Acorn and transform the SpiderMonkey AST into the UglifyJS2 AST, It would be overally slower by about 50ms. However, I expect this feature to still be useful to folks who generate a SpiderMonkey AST somehow (for example CoffeeScriptRedux) and want to use UglifyJS's mighty compression powers.

Why not using the SpiderMonkey AST internally?

Some people suggested this, and I feel like I owe an answer. Yesterday's hack was a small step in this direction (well, not to using the SM AST internally, but at least being able to grok one). It's the first time I study the SpiderMonkey AST format and I already found a few things I don't like about it; but first and most important, in order to use it I'd have to rewrite the mangler and compressor from scratch (and let's say I'd switch to escodegen for code output). I hardly finished the first big rewrite of UglifyJS and I'm not in a mood to start another one.

No inheritance1

This is the biggest issue. UglifyJS heavily relies on a certain hierarchy among AST nodes. For example the AST_Return and AST_Throw nodes inherit from AST_Exit, which in turns inherits from AST_Jump. In certain parts of the code we deal with return and throw in the same way, therefore we only need to check if node instanceof AST_Exit. There are lots of similar situations, for example AST_Assign is an instanceof AST_Binary, and this helps here and there.

More important than checking for node features, in UglifyJS nodes can have methods and they can be overridden by subclasses. Common OOP technique :p. That wouldn't work with flat objects, however. For example my code generator inserts a "print" method in every node's prototype, so to get the source code of any node I'd write node.print(stream). With a flat inheritance tree, that would have to be more like print(node, stream), where the "print" function would be a big switch block checking for the node type first. That's how escodegen looks like — a few gigantic switches. My code generator is something like half the size.

So any case: making my code work with flat objects would practically mean rewriting all of it. However, assuming Acorn or Esprima implement an inheritance structure, then refactoring UglifyJS to use the SpiderMonkey AST internally would become feasible. Hard work, but doable.

Glitches (IMO) in the SpiderMonkey AST format

Except for the inheritance part, the SpiderMonkey AST doesn't differ that much from UglifyJS2's AST structure. In both, a node is an object, it contains source code location info (and perhaps comments) and a few child nodes. However...

Abusing BlockStatement

I made this mistake in UglifyJS2 too, but I refactored it before the mess spread out. In the Mozilla AST, a BlockStatement is used as the block and finalizer of try node, and as the body of catch and function nodes (but inconsistently, not for switch nodes).

I think that's wrong; while there is a visual resemblance between a block statement and a function body, both semantically and syntactically they are different things. A block statement is, well, a statement (it even inherits from Statement in the Parser API spec), but a function or a try body is not quite a statement; the block brackets are required there.

Where this makes a difference for my humble compressor is, UglifyJS defines these simple rules for compressing a block statement:

  1. Compress every statement in body
  2. Tighten the body (this involves joining consecutive var statements, dropping unreachable code, joining simple statements with a comma, etc.)
  3. If body.length == 0, return an AST_EmptyStatement2.
  4. If body.length == 1, return body[0] — usually an AST_SimpleStatement.
  5. Otherwise, return this (that is, the AST_BlockStatement).

Rules 3 and 4 would break the output if a BlockStatement was used as body of a function, try, catch or finally nodes. I wasn't happy with the possible workarounds for this issue. I'd have to check either that the parent node is not a node where the block is required before applying 3 or 4, or handle a special case in the code generator (if the body of a statement where brackets are required is not instanceof BlockStatement, then write the brackets). It seemed like a better decision that the body of all these nodes is an array of statements, and have the code generator always add the brackets; and the compressor needs not to worry about changing a block statement to a simple/empty statement—it's always valid.

The SM AST got it right for SwitchStatement. It has a "cases" property which isn't a BlockStatement but an array of SwitchCase nodes. Perhaps it was designed this way because SwitchCase is not a Statement...

Less useful granularity

The SpiderMonkey AST has a UnaryExpression node which represents both prefix and suffix expressions, discerning between them with a "prefix" property. Not a big deal, but it's something of an annoyance; in UglifyJS we have AST_Unary which is the base class, and AST_UnaryPrefix and AST_UnaryPostfix inheritting from AST_Unary. So when necessary, we can treat both the same via the base class, but when we need to discern we have the subclasses.

Additionally, in SpiderMonkey there's a UpdateExpression with the same specification as UnaryExpression. Why having a different node type for this? Here's the specification for both:

interface UnaryExpression <: Expression {
    type: "UnaryExpression";
    operator: UnaryOperator;
    prefix: boolean;
    argument: Expression;
}
interface UpdateExpression <: Expression {
    type: "UpdateExpression";
    operator: UpdateOperator;
    argument: Expression;
    prefix: boolean;
}

So what differs is the operator; however, both in Acorn and Esprima, the operator is simply a string. I can't see any reason for having different node types.

Literals

On the other hand, there's a Literal node that represents: numbers, strings, true, false, null and literal regexps. Maybe that makes sense for SpiderMonkey, but I would call for more node types here. In UglifyJS the node hierarchy for these literals looks like this:

AST_Constant
   └── AST_String
   └── AST_Number
   └── AST_RegExp
   └── AST_Atom // I'm not using this particular class, could drop it
      └── AST_Null
      └── AST_Undefined
      └── AST_NaN
      └── AST_Boolean
         └── AST_True
         └── AST_False

That's more convenient; to define the compression rule for booleans, for example (which is simply to rewrite the tree to use !0 for true and !1 for false), I just insert the method in AST_Boolean and it returns the AST for !(1 - this.value). To use something like the SpiderMonkey AST I'd have to switch (typeof node.value) in a lot of places.

No “Directive” node

In UglifyJS, the parser tracks “directives” and creates instances of AST_Directive nodes3. It's kinda useful to have a dedicated node type. In the SpiderMonkey AST it's just an ordinary ExpressionStatement (that's the equivalent of AST_SimpleStatement in UglifyJS).

The compressor drops simple statements that don't have side effects. This is done in the compression method for AST_SimpleStatement itself. If the same node type would be used for directives, then the compressor would drop all directives and to work around this I would have to move the code that checks for side effects in the upper node, which could keep track of the statement's position (if at beginning of body, and if body is toplevel or function, then it's directive). That seems a lot more work than just having a dedicated node type.

Identifiers

In SpiderMonkey AST, we have the Identifier type, which is a node for a name, regardless if it's for access or definition. In UglifyJS2 there's some elaborate hierarchy for representing names:

AST_Symbol . . . . . . . . . . . // base class
   └── AST_This  . . . . . . . . // the "this" symbol
   └── AST_SymbolDeclaration . . // symbol that introduces a name in scope
      └── AST_SymbolVar  . . . . // symbol introducing a variable
         └── AST_SymbolFunarg  . // symbol naming a function argument
      └── AST_SymbolConst  . . . // symbol defined with "const"
      └── AST_SymbolDefun  . . . // symbol that defines a function (statement)
      └── AST_SymbolLambda . . . // symbol that names a function (expression)
      └── AST_SymbolCatch  . . . // symbol that names the exception in "catch"
      └── AST_Label  . . . . . . // symbol that defines a label
   └── AST_SymbolRef . . . . . . // reference to a AST_SymbolDeclaration
      └── AST_LabelRef . . . . . // reference to a AST_Label

It's pretty trivial to figure out at parse time which constructor to use, and the additional knowledge provided by this hierarchy is useful in the mangler and compressor:

But nothing's perfect

I'm not entirely happy with my AST either4, but I kinda like it more than the SpiderMonkey AST. My structures evolved as I needed change and it was actually nice to not be constrained by a rigid definition made for a different programming language. And now the code is written and it works quite well. I see no point to switch.

A day might come when a combination of Acorn/Esprima + Esmangle + Escodegen might eat my lunch, but I think that day won't happen anytime soon. Esmangle seems to be far behind: it's big, slow, it produces bigger code than UglifyJS and it breaks my scripts (UPDATE: Yusuke fixed the issues I've noticed quite quickly :-). I hope people using the SpiderMonkey AST will use UglifyJS to mangle and compress it, once this API will be ready.

Enjoy the mighty powers! :)

Footnotes
1. Note that the Parser API spec does specify inheritance for the interfaces that it defines. It's just that both Acorn and Esprima don't implement any inheritance; the AST they return is composed of flat objects.
2. That can be rendered simply as a semicolon, although it might be optimized away in the next pass, depending on the parent node.
3. although UglifyJS should do more than just tracking them, it should enforce strict syntactic rules when "use strict" is in effect. I'll implement this sometime.
4. I'd need multiple inheritance to be totally happy with my AST.
5 comments. This is HOT!

Add your comment

# Yusuke Suzuki
2012-10-07 09:54
Hello Mishoo, I'm owner of Esmangle & Escodegen. > No inheritance Yes, but it keeps AST simple JS object, and it can be dump to JSON text. Inheritance based implementation is very cool, but this is not simple JS object and it is difficult to use it as Intermediate Representation over a lot of JS modules. > a few gigantic switches At first, we implemented visitor based code generator(called striker), but I and Ariya prefered switch based implementation because its flow is understantable for code readers, so we rewrited it. > the operator is simply a string Because Esprima & Acorn is based on SpiderMonkey's Reflect.parse result. > No “Directive” node Yes, we think this should be fixed ;) https://github.com/Constellation/esmangle/issues/22 http://code.google.com/p/esprima/issues/detail?id=330 So we opened issue at bugzilla to fix the SpiderMonkey AST spec. https://bugzilla.mozilla.org/show_bug.cgi?id=791294 And I'm planning to preempt directive flag in Esmangle. > it breaks my scripts I would like to know these patterns. Would you like to inform it to me or file it to Esmangle issue? Directive problem will be fixed as described above. > it produces bigger code than UglifyJS Latest Esmangle has already produced smaller code than UglifyJS1 (in jQuery 1.8.2 compression) :) Regards, Yusuke Suzuki
# Mishoo
2012-10-07 12:37
> I would like to know these patterns. Would you like to inform it to me or file it to Esmangle issue? Well, it's kinda difficult to debug. I didn't investigate the issue, but here goes, I made a screen-cast to show the issue: https://dl.dropbox.com/u/86449772/DL/compressdl.mkv and here is the source code of my library: https://dl.dropbox.com/u/86449772/DL/DynarchLIB.zip (compress file src/js/thelib-full.js > src/js/thelib.js). To play with it you need to put it under a Web server and access index.html from the browser. When compressed with Esmangle there are some weird issues: the source code dialog popups only once when I press "Run code", but not on subsequent presses; the calendar doesn't popdown when a date is clicked and there probably are more. Unfortunately, no errors show up in the console.. :-( I'm using that library in a big project and there when compressed with Esprima I got more obvious failures (though again I didn't investigate what was possibly broken), however I cannot publish that project. > Latest Esmangle has already produced smaller code than UglifyJS1 (in jQuery 1.8.2 compression) :) So does UglifyJS2. ;-) Speaking of breakages, I also tried running the jQuery test suite and several tests failed with compressed with Esmangle. It got full coverage with UglifyJS2.
# Yusuke Suzuki
2012-10-07 20:51
> I made a screen-cast to show the issue Thanks, it helps me a lot! I'll investigate it! Probably I missed some assumptions on optimization. > So does UglifyJS2. ;-) Great! > Speaking of breakages, I also tried running the jQuery test suite and several tests failed with compressed with Esmangle. Thanks for your suggestion! I checked minified script on Modernizr test, but I'll try jQuery test suite.
# Ingvar Stepanyan
2014-04-22 00:38
Is there any opposite of `from_mozilla_ast` that converts UglifyJS AST to Mozilla one so Uglify could be used as intermediate tool for transforming Mozilla AST while using own AST internally?
# Mishoo
2014-04-22 22:41
Nope, not at the moment. It's not hard to write and it was on my TODO list, but since there wasn't any demand I didn't bother…