Colors: Light | Dark
Welcome! (login)

UglifyJS2: compression beats version 1

Sep
17
2012

UglifyJS2: compression beats version 1

Thank you again for your contributions to my pledgie!

Today I pushed out another batch of commits. I've spent quite some time on the compressor and it does very well in my tests. On jQuery it beats version 1 by 227 bytes after gzip (although surprisingly, the size before gzip is bigger by 80 bytes).

Following there's description of some cases where it performs better than v1.

Skip to usage if you're not interested in reading about compression.

Minimize the number of statements

As you know, v1 will join consecutive simple statements with the comma operator, in the hope that a block of statements will become a single statement and we can drop the block brackets. Here's something that v1 can do:

if (foo) {
    a = x();
    b = y();
}

==>

foo && (a = x(), b = y());

However, the presence of a "non-simple" statement will render this optimization useless:

if (foo) {
    a = x();
    b = y();
    for (; a < b; a++) console.log(a);
}

==>

if (foo) {
    a = x(), b = y();
    for (; a < b; a++) console.log(a);
}

The new compressor handles the above code like this:

if (foo) for (a = x(), b = y(); b > a; a++) console.log(a);

managing to save 3 bytes (the block brackets and one semicolon). Similar savings occur if the statement is switch, if, with, return or throw (version one handled the last two too).

Invent a “return” statement

One thing that compresses well is something like:

function f() {
    if (foo) return bar;
    return baz;
}

==>

function f() {
    return foo ? bar : baz;
}

But sometimes the if is not followed by anything:

function f() {
    if (foo) return bar;
}

V1 will leave that as is. V2 prefers to think there's a return undefined there and will transform it into the following:

function f() {
    return foo ? bar : void 0;
}

For this simple case it's a little more bytes in fact, but I found that in general it helps because it allows to reduce the number of statements. Here's a more complicated function and how it compresses in both versions (use the small button on the right which shows up when you hover the code to view it easier):

function setOpacity(el, o) {
    if (o != null) {
        if (o == "" && o != 0) {
            is_ie
                ? el.style.filter = ""
                : el.style.opacity = "";
        } else {
            is_ie
                ? el.style.filter = "alpha(opacity=" + Math.round(o * 100) + ")"
                : el.style.opacity = o;
        }
        return o;
    } else {
        if (!is_ie)
            return parseFloat(el.style.opacity);
        else
            if (/alpha\(opacity=([0-9.])+\)/.test(el.style.opacity))
                return parseFloat(RegExp.$1);
    }
}

==(v1)==>

function setOpacity(el, o) {
    if (o != null) return o == "" && o != 0 ? is_ie ? el.style.filter = "" : el.style.opacity = "" : is_ie ? el.style.filter = "alpha(opacity=" + Math.round(o * 100) + ")" : el.style.opacity = o, o;
    if (!is_ie) return parseFloat(el.style.opacity);
    if (/alpha\(opacity=([0-9.])+\)/.test(el.style.opacity)) return parseFloat(RegExp.$1);
}

==(v2)==>

function setOpacity(el, o) {
    return o != null ? (o == "" && o != 0 ? is_ie ? el.style.filter = "" : el.style.opacity = "" : is_ie ? el.style.filter = "alpha(opacity=" + Math.round(o * 100) + ")" : el.style.opacity = o, o) : is_ie ? /alpha\(opacity=([0-9.])+\)/.test(el.style.opacity) ? parseFloat(RegExp.$1) : void 0 : parseFloat(el.style.opacity);
}

Thanks to the imaginary return undefined, version 2 is able to get rid of three if and two return keywords, and makes it a single statement; all at the price of a void 0.

IF ... return/continue

Example for if...return:

function g() {
    if (foo) return;
    if (bar) return;
    if (baz) return;
    if (baa) return;
    a();
    b();
}

==>

function g() {
    foo || bar || baz || baa || (a(), b());
}

It can only do this, however, if the if/return appears directly under the function body (that is, not nested in another if, for etc.). A similar optimization occurs for if/continue, again, only if the if is directly in the block that continue refers to:

for (var i = 0; i < 5; ++i) {
    if (i < 3) continue;
    console.log(i);
}

==>

for (var i = 0; 5 > i; ++i) 3 > i || console.log(i);

Undefined

We now optimize typeof x == "undefined" to x === undefined. The undefined value itself will be rendered as void 0, unless there's a variable named "undefined" in scope—if so, AST_Undefined nodes will be rendered "undefined" and the mangler in turn will reduce the name to a single character. This saves some bytes on jQuery after gzip (also coupled with the “return undefined” statement that I mentioned before).

Smarter sequences

Two smallish optimizations for the “sequences” (comma operator):

x, x ==> x
x = expression(), x ==> x = expression()

Seems like nobody will write code like this, but our compressor does, for example:

stuff += expression();
if (stuff) { foo() }

==>

if (stuff += expression(), stuff) { foo() }

==(after AST_Seq optimization)==>

if (stuff += expression()) { foo() }

==(and after AST_If optimization)==>

(stuff += expression()) && foo()

I have bigger plans for this feature. Basically, if between the moment a variable is initialized and the moment it's referenced we can detect that no code may have side effects on that variable, and if that variable is only referenced once, then we could safely put the initialization in place of the reference and discard the initialization statement. It's tricky to do and I think it will slow down the compressor considerably, but it might also provide considerable savings, so I plan to work on it.

Fine-tuning for gzip

When name mangling is requested, v2 takes the trouble of counting character frequencies for all unmangleable keywords / strings / properties (idea discussed in v1) and starts generating names with the most frequently used characters1. This saves ~130 bytes on jQuery after gzip, although the unzipped file size is the same.

I also tried to sort names by number of references, and generate first those names that are most frequently used. This would mean that the most frequently used names would be shorter and made of characters with high frequency numbers. It saves almost 1K before gzip, but adds 450 bytes after gzip!! So I thought maybe gzip prefers more frequently used stuff to have longer names (it does very well compressing repetitive long sequences after all) so I tried the reverse sort, but that adds 1K after gzip. :-\

Optimizing for gzip is tricky business.

Another small optimization is to always prefer to use ">" and ">=" (no less-than operator). I stole this idea from Google Closure. It saves 23 bytes on jQuery.

Usage

To use the new version with NodeJS you need source-map and optimist modules, and then just clone the repository and symlink the command-line tool somewhere in your $PATH, for example:

npm install source-map
npm install optimist
cd ~
git clone git://github.com/mishoo/UglifyJS2.git
mkdir -p ~/bin
cd ~/bin
ln -s ~/UglifyJS2/bin/uglifyjs2

The command-line tool outputs the following on --help:

uglifyjs2 [options] input1.js [input2.js ...]
Maximum compression settings are on by default.
Use a single dash to read input from the standard input.

Options:
  --source-map       Specify an output file where to generate source map
  --source-map-root  The root of the original source to be included in the source map
  -p, --prefix       Skip prefix for original filenames that appear in source maps
  -o, --output       Output file (default STDOUT)
  --stats            Display operations run time on STDERR
  -v, --verbose      Verbose
  -b, --beautify     Beautify output
  -m, --no-mangle    Don't mangle names
  -c, --options      Pass compressor options

I briefly described the command line options in the previous post; what's new since then is the -b (if you want to get beautified output), -m if you want to skip the mangling step and the -c to specify compressor options.

The following compressor options are available (I'm sorry, this list is a bit messy and it's possible at places that it does compression even when options are off):

sequences     : join simple statements with the comma operator
properties    : a["foo"] ==> a.foo
dead_code     : drop unreachable code
drop_debugger : drop the debugger statement
unsafe        : apply the "unsafe" transformations
conditionals  : optimize if-s and conditionals
comparations  : optimize comparations
evaluate      : try to evaluate expressions
booleans      : optimize booleans
loops         : optimize loops
unused_func   : drop unused functions (not from toplevel scope)
hoist_funs    : hoist function declarations
hoist_vars    : hoist var declarations
if_return     : optimize if/return and if/continue
join_vars     : join consecutive var statements
cascade       : for now the sequence optimization, x = foo, x ==> x = foo

The argument of -c should be a comma-separated list of prop=value, where prop is an option from the list above and value is a JavaScript expression (note, it's evaluated! don't use the CLI with uncontrolled data). For now all options are booleans and they're all true by default.

There is no universally optimal way to compress all scripts, so you might want to disable some options and test if you get better file size after gzip. For example on jQuery hoist_vars and join_vars don't help too much; they do make the size smaller before gzip, but it's slightly bigger after gzip. I found the gzip-optimal way to compress jquery is like this:

uglifyjs2 -c hoist_vars=false,join_vars=false jquery-1.8.1.js > jq.js

If you want you can also generate a source map, for example:

uglifyjs2 jquery-1.8.1.js \
  -c hoist_vars=false,join_vars=false \
  --source-map jq.js.map \
  --source-map-root http://code.jquery.com/ \
  -o jquery-1.8.1.min.js

Thanks to “node-optimist” you can now pass arguments in any order (i.e. you can begin with the list of files to compress). UglifyJS2 supports multiple files in the command line, which is important for generating an useful source map.

Issues

(besides many missing features that v1 has)

Currently we lack support for directives (they're simply read as strings). I'll fix that shortly — done. Otherwise, please consider this code “bleeding edge”—I can't be sure it's as stable as version 1, until it will be in use by so many people.

The tests are still lagging behind :-(; the way I test the compression is, to me, more reassuring than a limited test suite; I'm compressing some 650K which is used in a big application; then load that application and start clicking around, and if it doesn't crash in a few minutes I declare myself happy.

Please use the Github issue tracker if you find any bugs!

Footnotes
1. On DynarchLIB, the most frequently used characters are, in this order: t, e, n, i, s. ;-)
5 comments. This is HOT!

Add your comment

# Dannii
2012-10-02 18:04
Great to see that it now counts character frequencies! :) Does that add much extra time? It's not too surprising that assigning the most frequent characters to the most referenced variables isn't as helpful. The most referenced vars are more likely to be in scope more of the time meaning that the character can't be reused. Consider that if your most referenced var is referenced 50 times then the char can be used 50 times, compared to the first parameter of 50 functions, each of which is only referenced twice (once in the parameters list and once in the function itself), which could be assigned the same character, resulting in it being used 100 times. If you already have reference counting code, could you try this? Reverse the list, and assign the characters to the least referenced variables first. Start with 't', and only move on to 'e' when you get to a variable that is in the same scope as another 't', and so on. The optimal compressor would solve some crazy graph colouring puzzle (http://en.wikipedia.org/wiki/Graph_coloring) to assign characters to variables that aren't in the same scope, but the programming and processing time would be quite unoptimal. But perhaps the reference counting code you already have might give an improvement if the list is reversed.
# Dannii
2012-10-03 16:04
I should say that if you can push your attempt at sorting vars by reference count then I'll try these ideas myself. I'd like to try sorting by count/scope_height, where scope_height would be the greatest number of nested scopes within a function, so that a func with no children has a height of 1, its parent has a height of 2, and its grandparent a height of 3. I think this might provide a reasonable weighting to reference counts vs scope while being relatively processor light.
# Mishoo
2012-10-03 16:13
Indeed, good point that the most frequently used vars are in higher scope! So the code that sorts by frequency is there, you just need to pass -m sort. I'll try to take nesting into account too. In fact, the solution might be simpler than expected — I think I'd simply have to mangle “depth-first” (descend to most nested scopes first and start mangling from there). Will try that.
# Dannii
2012-10-03 16:49
Hey Mishoo, I've found the relevant code... because the link in the paragraph starting with "I also tried to sort names" is no longer relevant I thought you hadn't pushed it. I'll have a try, and if I have any success I'll send you a pull request. I'm impressed with your code btw! Do you happen to have test for gzip size comparisons already?
# Mishoo
2012-10-03 17:34
Nah, just using this quick'n'dirty script that compares jQuery compressed with first and current UglifyJS versions: echo v1 time uglifyjs -nc ~/tmp/jquery-1.8.1.js > /tmp/jq1 echo v2 time uglifyjs2 ~/tmp/jquery-1.8.1.js -c hoist_vars=false,join_vars=false -m > /tmp/jq2 gzip -9 -c /tmp/jq1 > /tmp/jq1.gz gzip -9 -c /tmp/jq2 > /tmp/jq2.gz ls -la /tmp/jq1* /tmp/jq2*