2019-05-16: I used to avoid releasing the source code to see how far I can go without other influences, but after Maxime Euzière's enough persuasion I've released the raw source codes available in GitLab. This page will continue to host explanations not obvious from the code.
A simple but efficient 5 bit packer with the delta coding of the shared prefix. Additional 3 bits for rare specials.
s=`ENCODED`;w="";b=1;v=5;for(i=0;s[i];++i)for(c=s.charCodeAt(i),b=b<<(c>>7?10:7)|c&1023;b>>v;v-5?w+=" '-.2é♀♂"[v=5,d&7]:d?d<27?w+=String.fromCharCode(96+d):w=w.substr(console.log(w),d-27):v=3)d=b&31,b>>=v
Adapted a Huffman decoder from practically.
k=[C="charCodeAt"];b=u=2e4;for(c of`CODES`)b+=(k[b]=c[C]())?1:b;w="";for(c of`ENCODED`)for(c=c[C]()^128;c>1;c>>=1)for(v=k[u=u*2+c%2];v|=0;u=2e4)d=v%43,v/=43,d>8?w+=" '-.2é♀♂"[d-9]||String.fromCharCode(80+d):w=w.substr(console.log(w),d-1)
Multiple incremental improvements over ivysaur.
k=[b=u=2e4];for(c of`CODES`)b+=(k[b]=c[C="charCodeAt"](j=w=""))?1:b;for(c of`ENCODED`)for(c=c[C](),c=c>>7?c<<5|[13,4,28][c&3]:c^128;c>1;c>>=1)for(v=k[u=u*2+c%2];v|=0;u=2e4)d=v%36,v/=36,d>8?w+=d-9?String.fromCharCode(87+d):`'éé- .. ♀♂-2`[j++]:w=w.slice(console.log(w),d-1)
A direct adaptation of bulbasaur to emit 4 bit nibbles instead.
(I don't know why I ended up with "henry".)
s=`ENCODED`;w="";b=1;for(i=0;s[i];++i)for(c=s.charCodeAt(i),b=b<<(c>>7?10:7)|c&1023;b>>4;b>>=4)w+=(b&15).toString(16),w.length%64||console.log(w.substr(-64))
Switched to BigInt
—I suspected this to be available for a long time. Also no longer try to extract as many bits from two-byte UTF-8 sequences; it now always reads 7 bits per code point (this approach is described below). As this allowed a very efficient packing of nibbles, it became much easier to replace a sequence of nibbles to a newline instead of counting nibbles to insert a newline! This clever 7 bit packer has greatly improved both the code size and data size.
b=0n;for(c of`ENCODED`)b=b<<7n^BigInt(c.charCodeAt());console.log(b.toString(16).replace(/6a/g,`NEWLINE`))
No code changes, but takes advantage of the rule that reordering is permitted. The original code was optimal in the code point count but 10 bytes larger in the byte count; this submission reduced a gap to 2 bytes. I believe this can be further optimized, but it is hard to search efficiently.
Okay, this one is annoying. It should be noted that this file contains:
This solution does try to exploit them but got nowhere. A ton of bit packing and implicit assumptions (hard to debug) are made: varying bit length depending on the maximum element (there are 3 modes, the last being a "bignum" mode), delta coding for the first two modes, metadata for periodicity and sign coding... Hence the name.
s=`ENCODED`;i=0;b=1;for(r=v=>b>>v?(q=b&(1<<v)-1,b>>=v,q):(g=s.charCodeAt(i++),b=b<<(g>>7?10:7)|g&1023,r(v));l=r(7);console.log(a+''))for(m=r(6),v=2,a=[],e=0;v||e<l;v&&-~c>>v?v>9||r(1)?v=0:v+=v%5:c>999?a[e-1]+=(c+'').substr(1):(e++&&m&2?c+=a[e-2]:0,a.push(v&&m&1&&r(1)?-c:c)))c=v?r(v):a[e-(m>>2)]
Ditched the manual coding and switched to the rANS decoder and order 2 context model (plus delta coding to squeeze a little bit more). This alone was able to save 4KB, and next 5KB was saved via manual definition matching. 91 out of 176 sequences were encoded in about 1.7KB of code before the compression. They are of course compacted, but not necessarily shortest but rather consistent in style because repetition helps the compressor. For the obvious reason several shared primitives like σn are introduced.
n=r=-1;t=INITIALSTATE;M=2<<16;A=x=>Array(M<<7).fill(x);m=A(1);p=A(M/4);s=[B=BigInt];L=x=>s=[n=~console.log(x+'')];P=x=>x&&P(x>>1)+x%2;R=(f,x=s[n-1]+1||0)=>f(x)?x:R(f,x+1);S=(x,n=1,y=x)=>y&&!(x%y)*y**n+S(x,n,y-1);I=x=>S(x)==x+1;T=n=>n-(N=1.3*n**.51|0)*-~N/2;for(k=i=c=0;i<8121;v-=256,c=c%256<<8|v,k?v?o+=String.fromCharCode(v):L(s,eval(o.repeat(k),k=0)):v>>7?~n?s[n]=s[n]<<7n|B(v&127):(k=v&127,o=';s[++n]='):v-97?s[++n]=B(v-=(v>97)<<7):L(s.map(v=>w+=v,w=0n)))for(v=1;v<256;p[d]+=(b*M/2-p[d]<<14)/(m[d]+=2*(m[d]<126))>>13,v=v*2+b)for(q=p[d=c<<8|v]*2+1,b=t%M<q,t=(b?q:M-q)*(t>>17)+t%M-!b*q;t<M<<4;r=r?y&127:-1)y=~r?r:`ENCODED`[C='charCodeAt'](i++),t=t<<7|([,13,36,92,96][r=y>>7]||y)
A 7 bit arithmetic decoder (rANS) and five different context models (order 0 through 2 contexts + sparse contexts (0,2) and (0,3)) combined with the logistic context mixing. A recursive bigram and delta coding was also used to exploit low-hanging fruits. Total memory consumption: ~7MB (assuming the optimal memory layout, there is no fancy object), total wall time < 10s.
This was a bit hard, taking much time, but I did it. I've also tried to exploit the columnar structure but given enough sparse contexts it was deemed redundant. Resulting decoder is also a bit beefy, weighing at ~700 bytes. But the CM already gave me much margin. There are some crazy codes to reduce operations and byte counts simultaneously too.
s=`ENCODED`;r=-1;t=INITIALSTATE;M=2<<16;D=[];for(i=1;i<128;i+=2)D.push(M/i|0);m=[...'12333'].map(i=>[Array(1<<6*i).fill(0),Array(1<<6*i).fill(M/4),[0]]);z=[];a="";e=d=>(x=`BIGRAMS`.charCodeAt(d-50))?e(x%64,e(x>>6)):d<24?a=a.substr(console.log(a),a.length-d):a+=String.fromCharCode(73+d);for(i=k=0;k<DECODEDLEN;e(z[k++]=v-64))for(v=1;v<64;m.map(([c,p,w],i)=>(y=p[j=u[i]<<6|v]+=((b<<16)-p[j])*D[c[j]+=c[j]<63]>>16,w[0]+=x[i]*(b-q/M))),v=v*2+b)for(u=[,u=z[k-1],z[k-2]<<6|u,z[k-3]<<6|u,z[k-4]<<6|u],q=0,x=m.map(([,p,[w]],i)=>(x=p[u[i]<<6|v]*2+1,x=Math.log(x/(M-x)),q-=w*x,3/2048*x)),q=M/(1+Math.exp(q))|1,b=t%M<q,t=(b?q:M-q)*(t>>17)+t%M-!b*q;t<1<<21;r=r?y&127:-1)y=~r?r:s.charCodeAt(i++),t=t<<7|([,13,36,92,96][r=y>>7]||y)
Tried to make use of more context models (sparse contexts (1), (1,2) and (2,3)); adding a new one costs about 10–15 bytes, so it was a good tradeoff. It now uses 7-bit codes instead of 6-bit codes (for more bigrams), so the memory usage saw tenfold increase. Also tweaked the decompressor to save dozens of bytes.
r=-1;t=INITIALSTATE;M=2<<16;A=Array;m=[...'12233333'].map(i=>[A(i=1<<7*i).fill(M/4),[0],A(i).fill(1)]);z=[a=""];e=d=>(x=`BIGRAMS`[K](d-50))?e(x%128,e(x>>7)):d<24?a=a.slice(console.log(a),a.length-d):a+=String.fromCharCode(73+d);for(i=k=0;k<DECODEDLEN;e(z[k++]=v-128))for(v=1;v<128;m.map(([p,w,c],i)=>(y=p[j=u[i]<<7|v]+=(b*M/2-p[j]<<14)/(c[j]+=2*(c[j]<126))>>13,w[0]+=x[i]*(b-q/M))),v=v*2+b)for(B=z[k-3],C=z[k-4],u=[q=0,u=z[k-1],A=z[k-2],A<<7|u,B<<7|u,C<<7|u,B<<7|A,C<<7|B],x=m.map(([p,[w]],i)=>(x=p[u[i]<<7|v]*2+1,x=Math.log(x/(M-x)),q-=w*x,x/1024)),q=M/(1+Math.exp(q))|1,b=t%M<q,t=(b?q:M-q)*(t>>17)+t%M-!b*q;t<M<<4;r=r?y&127:-1)y=~r?r:`ENCODED`[K='charCodeAt'](i++),t=t<<7|([,13,36,92,96][r=y>>7]||y)
A 5/6 bit packer with the delta coding of the shared prefix and the additional bigram coding. Bigram lists are conveniently stored before the actual bitstream, so the bit decoding routine is shared (they are distinguished by the bit width, 5 being a bigram and 6 being a code).
s=`ENCODED`;w="";b=1;v=5;m=[];p=d=>d>3?w+=String.fromCharCode(93+d):w=w.substr(console.log(w),d);for(i=0;s[i];++i)for(c=s.charCodeAt(i),b=b<<(c>>7?10:7)|c&1023;b>>v;v=m[67]?6:5)d=b&63,b>>=v,v-5?p(d<30?d:p(m[d-=30])&&m[d+34]):m.push(d&31)
A carefully designed Huffman decoder. The tree is stored in the compressed format (length-only and bigrams built-in) and rebuilt in ~60 bytes of code (described below). The delta coding is still used.
s=`ENCODED`;z=`CODES`;k={};b=u=9e3;for(i=0;z[i];c?k[b++]=c:b*=2)c=z.charCodeAt(i++);w="";for(i=0;s[i];)for(c=s.charCodeAt(i++)^128;c>1;c>>=1)for(v=k[u=u*2+c%2];v|=0;u=9e3)d=v%32,v/=32,d>5?w+=String.fromCharCode(91+d):w=w.substr(console.log(w),d-1)
Incremental improvement over conversation. Features a much smaller Huffman decoder (again described below).
k=[C="charCodeAt"];b=u=9e3;for(c of`CODES`)b+=(k[b]=c[C]())?1:b;w="";for(c of`ENCODED`)for(c=c[C]()^128;c>1;c>>=1)for(v=k[u=u*2+c%2];v|=0;u=9e3)d=v%32,v/=32,d>5?w+=String.fromCharCode(91+d):w=w.substr(console.log(w),d-1)
Incremental improvement over practically, mostly shared with venusaur due to the use of the common Huffman decoder.
k=[b=u=9e3];for(c of`CODES`)b+=(k[b]=c[C="charCodeAt"](w=""))?1:b;for(c of`ENCODED`)for(c=c[C](),c=c>>7?c<<5|[13,4,28][c&3]:c^128;c>1;c>>=1)for(v=k[u=u*2+c%2];v|=0;u=9e3)d=v%32,v/=32,d>5?w+=String.fromCharCode(91+d):w=w.slice(console.log(w),d-1)
Since the verification environment always expects a UTF-8-encoded source code, and since the code size is measured in bytes (not "code points"), it is hard to make use of all 8 bits in a byte; we generally aim for 7 therefore, and refer this unit to "byte" hereafter.
Thanks to ES6 template literals we can put many control characters that are invalid in normal literals, but still we need to escape the following characters: \r
(will be mangled to \n
), $
, `
and \
.
Here's where I took inspiration from base-122 encoding. In short, two-byte UTF-8 sequence has almost 11 bits to spare, so when the invalid byte appears you can merge it to the next byte and store as a single code point without any escape sequence. Since this takes only 2 + 7 = 9 bits, depending on the application you can actually put more bits when the invalid byte appears (!). This forms the basis of virtually all my submissions.
It should be noted that you should be careful to integrate this bit packer to a bit decoder, as it will alter the decoding order. There are multiple strategies:
xx
followed by an 8-bit byte yyyyyyyy
is encoded as 000001yy yyyyyyxx
and expanded as yyyyyyyy XXXXXXXX
because bits are read from the least-significant bit.encode_bq_v1
for the details.It is a well-known fact that Huffman trees can be described as the lengths of each code, provided that the trees are skewed towards the ones (shorter code is always lexicographically placed before longer code). This is famously used in the DEFLATE and gzip format. Since practically, the tree is encoded as a zero-separated string of corresponding codes (chosen that they are never zeroes).
That was for the external representation. How about internal representation? Instead of chasing pointers, we can represent leaf nodes as an associative array mapping the bit sequence to the code. If not found, read one more bit and retry. This can be done with a string key, but it is annoying to reconstruct; you need a .toString(2)
with padding, which is no avail in JavaScript.
Instead of string keys, we use integer keys directly representing the bit sequence. However how can you distinguish, say, 01
from 1
? Therefore we make use of non-zero guard bits to put above actual bits. With guard bits you can easily distinguish GUARD01
from GUARD1
. The guard bits should be longer than the longest code, but otherwise can be freely chosen (for example, ivysaur uses 2e4
because there is 16 bit code). This approach equally applies to the bit reader itself, where the guard bit can be as short as 1
.
Combining the tree reconstruction algorithm and guard bits, we result in the following compact code:
// TREE RECONSTRUCTIONt={};// tree mappingb=GUARD;// next bits to putfor(c of`CODES`) b+=(t[b]=c.charCodeAt())?1// associate b to the code, and increase the code:b;// or double b, effectively putting 0 to the end// DECODERb=GUARD;// currently read bitsfor(c=BYTE;// the topmost bit (not a part of data) should be setc>1;// if only the guard bit is remaining, read one more bytec>>=1)// read from the least significant to the most significant(v=t[b=b*2+c%2])&&// v is the code or undefined;// we can distinguish undefined by having non-zero codes(b=GUARD,...)// we reset read bits and use v
(In reality, this code is interleaved with bigram decoder. And I used a clever preprocessor to put the guard bit to the read byte without even checking if it is an one-byte or two-byte sequence—see adjust_bq_for_bitstream
.)
I won't explain what are those here. See Wikipedia and Matt Mahoney's excellent book. For the rANS, Fabian Giesen's simple implementation is fairly readable.
Since rANS and context model is inherently simple, there is not much to cut off. It should be noted that, the bit probability reported by the context model should be clamped between 0 and 1 exclusive (otherwise rANS will divide by zero). Therefore I have represented the probability as a fixed point number, with the lowest bit always set whenever possible. This is not as useful for the normal context model, but greatly useful for the logistic mixing. The same technique also applies to the counts (unlikely the probability, this is commonly used in the performant encoders).
There are many possible approaches to the modeling, and I'm yet to explore many of them (for example, I haven't even started with higher order context models because hashing the context is nontrivial). I have at least tried entropy reducing transforms (namely BWT and MTF) and found that they actually perform worse with the context mixing.
sayuri.tx0.org