Docstoc

Research Papers on Data Compression

Document Sample
Research Papers on Data Compression Powered By Docstoc
					                                   JSZap:
                                   Compressing
                                   JavaScript
                                   Code




                Martin Burtscher, UT Austin
Ben Livshits & Ben Zorn, Microsoft Research
                   Gaurav Sinha, IIT Kanpur
     A Web 2.0 Application Dissected

Talks to 14 backend
      services               1+ MB code
  (traffic, images,
 directions, ads, …)




 70,000+ lines of
 JavaScript code            2,855 Functions
  downloaded

                                              2
  Lots of JavaScript being Transmitted
                      0%   10%   20%   30%   40%   50%   60%    70%    80%    90%   100%


      www.live.com

spreadsheets.google

          maps.live

       chi.lexigame

           hotmail
                            Up to 85% of a Web 2.0
             gmail

         dropthings
                             app is JavaScript code!
       maps.google
                                              Fraction of download that is JavaScript
         pageflakes

        bunny hunt



                                                                                        3
AJAX: Tension Headaches



       Move code to    Execution can’t
         client for     start without
      responsiveness      the code




                                         4
             JavaScript on the Wire




                           JSZap
JavaScript        crunch             gzip


                gzip -d            parser   AST
                                                  5
             JSZap Approach

• Represent JavaScript as AST instead of source

• Serialize the compressed AST

• Decompress directly into AST on client

• Use gzip as 2nd-level (de-)compressor



                                                  6
Benefits of AST-based Compression
   Reduced Latency

   • Compression: less to transmit
   • ASTs are blasted directly into the browser

   Reduced Network Bandwidth

   • Reduces mobile charges
   • Reduces operator network costs: better for servers

   Correctness, Security, and other Benefits

   • Ensures well-formedness of code
   • Can use to check language subsets easily (AdSafe)
   • Caching incremental updates
   • Unblocking HTML parser

                                                          7
             JSZap Compression




JavaScript     JSZap     gzip




                                 8
             JSZap Compression




                 productions
             1
JavaScript       identifiers   gzip
             2
                 literals
             3
                                      9
   GZIP is a
formidable
 opponent

          10
         JSZap vs. GZIP
       Literals   Identifiers   Productions
                                              40

                                              35

11.5                                          30
                                  8.4




                                                   Size in KB
                                              25

                                              20

19.0                              18.4        15

                                              10

                                              5
5.4                               5.4
                                              0
gzip                             JSZap

                                                          11
    Talk Outline


    productions
1
                   evaluation
    identifiers      on real
2                     code
    literals
3




                                12
             Background: ASTs
Expression         Grammar                       Tree

                                                      1
a*b+c         1)   E    E + T                     +
              2)   E    T
              3)   T   T * F            3
                                             *               c
              4)   T    F
                                                                 5
              5)   F    id
                                     a            b
                                 5                      5



                                                            13
       A Simple Javascript Example
var y = 2;
function foo () {
       var x = "jscrunch";
       var z = 3;
       z = y + y;
}
x = "jszap";


Production Stream
 1      3        4      ...   1     3       4   ...


Identifier Stream
 y       foo        x   z     z         y   y   x

Literal Stream

"jscrunch"       2      3     "jszap"
                                                      14
              Benchmarking JSZap
                                     • JavaScript files up
Benchmark name   Source    Source
                  lines     bytes      to 22K LOC
gmonkey             922     17,382
getDOMHash         1,136    25,467   • Variety of app types
bing1              3,758    77,891
bingmap1           3,473    80,066
livemsg1           5,307    93,982   • Both hand-
bingmap2           9,726   113,393     generated, and
facebook1          5,886   141,469     machine-generated
livemsg2           7,139   156,282
officelive1       22,016   668,051
                                     • gzipped everything
                                                             15
 Components of JavaScript Source
                              productions              identifiers              literals
100%
 90%
 80%                              • None of the categories can be ignored
 70%
 60%                              • Identifiers become more prominent with code growth
 50%
 40%
 30%
 20%
 10%
  0%
                                 bing1




                                                          livemsg1




                                                                                    facebook1


                                                                                                livemsg2
       gmonkey




                                                                                                           officelive1
                                            bingmap1




                                                                     bingmap2
                 getDOMHash




                                                                                                                         16
 Compressing the Production Stream
• Frequency-based production renaming

• Differential encoding: 26 and 57 => 2 and 3

• Chain rule: eliminate predictable productions

• Tree-based prediction-by-partial-match

                                                  17
                    PPMC
                      • Tree context used to build a
                        predictor
• Consider compressing
   – if (P) then X else X
                    • Provides the next likely
               …
                      child node given context C
• Should be very compressible position p
           …
                      and child
  • if (P) then ...abc... else ...abc...

        P             • Arithmetic coding: more
                        likely=shorter IDs
    X       X
                      • See paper for details


                                                       18
              Production Compression (gzip = 1)




                      50%
                      65%
                      70%
                      75%
                      80%
                      85%
                      90%
                      95%




                      55%
                      60%
                     100%
        gmonkey


     getDOMHash


            bing1


       bingmap1


        livemsg1


       bingmap2


       facebook1
                             0.6772




        livemsg2
                                                  Production Compression with PPMC




       officelive1
19
Compressing the Identifier Stream
• Symbol tables instead of identifier stream:
  – Compress redundancy: offset into table
  – Global or local symbol tables
  – Use variable-length encoding


• Other techniques:
  – Sort symbols by frequency
  – Rename local variables

                                                20
Variable-length Encoding for Identifiers

                   is global?




      is renamed local         fits in 1 byte?




                         00…                     11…




                         01…                     10…


                                                       21
  Variable-Length Identifier Encoding
100%
 90%
 80%
 70%
 60%
                                                                                                            parent
 50%
 40%                                                                                                        local 2byte
 30%                                                                                                        local 1byte
 20%                                                                                                        local builtin
 10%
                                                                                                            global 2byte
  0%
                                                                                                            global 1byte
       gmonkey




                                                 livemsg1




                                                                       facebook1

                                                                                   livemsg2

                                                                                              officelive1
                                      bingmap1




                                                            bingmap2
                 getDOMHash

                              bing1




                                                                                                                          22
                                 Identifiers (NoST = 1)




                                 80%
                                             90%
                                                    95%




                                       85%
                                                             100%
                    gmonkey


                 getDOMHash


                        bing1


                   bingmap1




     Global ST
                    livemsg1


     VarEnc        bingmap2
                                                     0.943




                   facebook1
                                              89%




                    livemsg2
                                                                    Symbol Tables: Effectiveness




                   officelive1
23
           Compressing Literals
•   Symbol tables
•   Grouping literals by type
•   Pre-fixes and post-fixes
•   These techniques result in 5-10% savings
    compared to gzip




                                               24
       Average JSZap Compression: 10%
                               100%
JSZap Compression (gzip = 1)



                                98%
                                96%                          Productions,
                                                                 26%
                                94%
                                92%
                                90%
                                                                                                             0.8792
                                88%
                                86%                          Identifiers,                         13% savings
                                84%                             57%
                                82%
                                80%
                                      gmonkey




                                                                                       livemsg1




                                                                                                              facebook1


                                                                                                                          livemsg2


                                                                                                                                     officelive1
                                                                            bingmap1




                                                                                                  bingmap2
                                                getDOMHash


                                                                bing1




                                                             Literals, 17%




                                                                                                                                                   25
       Summary and Conclusions
• JSZap: AST-based compression for JavaScript

• Propose a range of techniques for compressing
   – Productions
   – Identifiers
   – Literals

• Preliminary results are encouraging: 10% savings over gzip

• Future focus
   – Latency measurements
   – Browser integration

                                                               26
     Security                   Well-
     (AdSafe)                formedness




                                     Unblocking
                   AST
?               representation
                                       HTML
                                       parser



                            Caching and
    Compression
                            incremental
     with JSZap
                              updates




     Questions?                                   27

				
DOCUMENT INFO
Description: Research Papers on Data Compression document sample