Clean your codebase with basic information theory

Cut out everything that’s not surprising.

– Derek Sivers

Surprise

The following equation measures
“surprise”:

It’s easy to estimate the surprise of text files:

const rx = /[s,][()]+/g;
const counts = {};
for (const file of Deno.args)
  for (const x of (await Deno.readTextFile(file)).split(rx))
    counts[x] = 1 + counts[x] || 1;
const total = Object.values(counts).reduce((a, b) => a + b, 0);
let entropy = 0;
for (const c of Object.values(counts)) {
  const p = c / total;
  entropy -= p * Math.log2(p);
}
console.log(entropy);

Note: this script calculates surprise at the level of words. You can do
similar estimates at the character level, but it’s not as useful for improving
readability.

Running the script on a few random markdown files:

$ ./entropy src/enough.md
6.371408749206546

$ ./entropy src/two-toucans-canoe.md
6.452173798618295

$ ./entropy src/enough.md src/two-toucans-canoe.md
7.1796868604897295

$ ./entropy src/music.md
8.172144309119338

$ ./entropy src/pardoned.md
8.180170381663272

/enough, /two-toucans-canoe, /music,
/pardoned

If you look at the examples, you’ll see that entropy generally increases with
file size, excluding files with a lot of repetition, e.g.
my music ratings.

Compression

To compress text at the word-level, substitute repeated phrases with shorter
phrases, e.g. replace “for the purpose of” with “for”. Note that replacing
“gargantuan” with “big” may improve surprise at the character-level, but not
necessarily at the word-level.

This is a consequence of
Shannon’s source coding theorem.
The theorem also provides calculable upper-bounds for text compression.

In theory, you could use something like
Huffman coding to shrink the
size of your code.
Minifiers use
related strategies to produce unreadable-yet-valid gibberish.

You can also
estimate upper-bounds for compressed programs.
Implement something like gzip in a target language and extract the program
with its decompressor to a new file. The number of bits in the
self-extracting archive
approximates the size of the smallest possible program.

In practice, compressing programs for humans means replacing larger snippets
with smaller functions, variables, etc.

Here’s an example from a real-world Elm file with 2.2k LOC:

$ ./entropy BigProject.elm
8.702704507769093

# `Css.hex "#ddd"` -> `ddd`
$ sed -i 's/Css.hex "#ddd"/"ddd"/g' BigProject.elm

# less repetition creates more surprise
$ ./entropy BigProject.elm
8.706398741147163

How would this number change over the life of a project? I’d like a script to
compare the total entropy of all files in a repo at each release and throw it in
a chart. Please email me at [email protected] if you
have any implementation suggestions.

To find compression candidates, look for frequent sequences of words, i.e.
code collocates.

Although it’s possible to automate long-sequence identification, I prefer to
list frequent words and then compress in conceptual chunks. This script
calculates frequencies at the word-level, but also weights words based on their
character length:

const rx = /[s,][()]+/g;
const len = {};
for (const file of Deno.args)
  for (const x of (await Deno.readTextFile(file)).split(rx))
    len[x] = x.length + len[x] || x.length;
console.log(
  Object.entries(len)
    .filter(a => a[1] >= 10 && a[0].length > 2)
    .sort((a, b) => b[1] - a[1])
    .map(a => `${a[1]} ${a[0]}`)
    .join("n")
);

Running the script on my Elm project:

$ ./freq-lengths BigProject.elm
1467 Attrs.css
1295 Css.rem
732 Css.fontSize
616 Html.div
594 Html.text
570 Css.backgroundColor
455 Css.hex
391 Maybe.withDefault
360 import
354 Css.px
336 Css.fontWeight
322 Css.lineHeight
304 Css.borderRadius
297 Css.color
297 Html.span
273 Css.marginTop
266 Nothing
252 String
231 Css.int
225 Css.solid

Based on these results, my first instinct is to pull out some of the CSS stuff
into helper functions, especially for font-related styling. C’est la vie.

Readability

Code golf is entertaining, but my boss
would be pissed to find a program like
“t,ȧṫÞċḅ»Ḳ“¡¥Ɓc’ṃs4K€Y in our
repo.

While
readability formulas exist for text,
“readable code” remains controversial.

In my experience, the key to maintaining readability is developing a healthy
respect for locality:

coarsely structure codebases around CPU timelines and dataflow
don’t pollute your namespace – use blocks to restrict
variables/functions to the smallest possible scope
group related concepts together

The hardest part of this process is deciding what “related concepts” mean. For
example, should you group all CSS in one file, or should you group styles next
to their HTML structures? Should you group all your database queries together,
or inline them where they’re used?

I don’t have all the answers, but it’s abundantly clear that we have a
thirty million line problem and
that
complex ideas fit on t-shirts.
When I see
an entire compositor and graphics engine in 435 LOC,
I expect many surprises from the future of coding.

Clean your codebase with basic information theory

Surprise

Compression

Readability

AI solves International Math Olympiad problems at silver medal level

Jacek Karpińśki, the computer genius the communists couldn’t stand (2017)

Reverse Engineering for Everyone

LEAVE A REPLY Cancel reply

Most Popular

Facebook doesn’t think hackers accessed third-party sites

It’s getting a lot harder for global brands to win in China

Why it’s time for investors to go on the defense

Facebook doesn’t think hackers accessed third-party sites

Recent Comments

EDITOR PICKS

Top Fashion Trends to Look for in Every Important Collection

Spring Fashion Show at the University of Michigan Has Started

Top Ten Kitchen Shortcuts for Indian Food Delights

POPULAR POSTS

Reflecting on 18 Years at Google

Gboard Hat Version

Feathered robotic wing paves way for flapping drones

POPULAR CATEGORY